mjordan / islandora_workbench

A command-line tool for managing content in an Islandora 2 repository
MIT License
24 stars 39 forks source link

Media referenced by URL with `?raw=true` suffix not loading. #260

Closed Digital-Grinnell closed 3 years ago

Digital-Grinnell commented 3 years ago

Working on ICG's test of I8 and we have a Google Sheet with media/file references like this:

https://github.com/Islandora-Collaboration-Group/islandora-sample-objects/blob/master/VIDEO/Video_03/Video_03.mp4?raw=true

This was for an object to which I assigned a unique ID of 1100118. The “intermediate” file left behind in my input_data folder was correspondingly named 1100118/Video_03.mp4?raw=true. That intermediate file was a viable .mp4 video because I was able to play it locally once I removed the ?raw=true suffix from the filename. Unfortunately, Workbench was subsequently unable to upload the viable "intermediate" file, presumably because the filename still had the ?raw=true suffix.

So, I changed the entry in the Google Sheet to read https://github.com/Islandora-Collaboration-Group/islandora-sample-objects/blob/master/VIDEO/Video_03/Video_03.mp4 thinking that would solve the problem. It did not. That URL returned an HTML response and I subsequently got the following in my log file:

28-Apr-21 15:15:19 - INFO - Node for Olivia's Arrival (record 1100118) created at https://icg-islandora.williams.edu/node/63.
28-Apr-21 15:15:21 - ERROR - Media not created, PUT request to "https://icg-islandora.williams.edu/node/63/media/video/16" returned an HTTP status code of "403".
mjordan commented 3 years ago

@Digital-Grinnell As a generic (non-Github specific) solution, I wonder if just having Workbench strip everything including and after a ? in the filenames of downloaded files would be sufficient? Using your example, if the remote filename is Video_03.mp4?raw=true, the version that Workbench saves would be Video_03.mp4.

Digital-Grinnell commented 3 years ago

Yes, that was what I was thinking too. Asked on my experience I think it would work.

Sent from my iPad

On May 3, 2021, at 8:43 PM, Mark Jordan @.***> wrote:



@Digital-Grinnellhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Digital-2DGrinnell&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=D8E-oGNaPT9srWV6jE8UP5unsmKEmmHEH-tzgmjBvLk&m=QWgax0m8rYS-6htgq3zGc_hpwcGu_phq1MA2SWPQb2Q&s=bCjfl8CO0Tr_9qnh9kQ6EIGt8O6YjF2W9LvRmUvmJzE&e= As a generic (non-Github specific) solution, I wonder if just having Workbench strip everything including and after a ? in the filenames of downloaded files would be sufficient? Using your example, if the remote filename is Video_03.mp4?raw=true, the version that Workbench saves would be Video_03.mp4.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Fworkbench_issues_260-23issuecomment-2D831637728&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=D8E-oGNaPT9srWV6jE8UP5unsmKEmmHEH-tzgmjBvLk&m=QWgax0m8rYS-6htgq3zGc_hpwcGu_phq1MA2SWPQb2Q&s=CJiKq2xWTPCk1KoXwATWIUV5tl7jm7hymMMfLg7npC0&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADTEQ6CQBNUYF5GZXQSRBMTTL5GK3ANCNFSM432NCJXQ&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=D8E-oGNaPT9srWV6jE8UP5unsmKEmmHEH-tzgmjBvLk&m=QWgax0m8rYS-6htgq3zGc_hpwcGu_phq1MA2SWPQb2Q&s=XLHSawKKVR8_lFLyqRdqSQ7MDdmX4etifUuxBL5qYxw&e=.

mjordan commented 3 years ago

@Digital-Grinnell can you test the issue-260 branch to see if it resolves the problem?

Edit: hold off, I broke something else.

mjordan commented 3 years ago

@Digital-Grinnell OK, the issue-260 branch is ready to test, if you can.

McFateM commented 3 years ago

I pulled the issue-260 branch and subsequently ran a brief test last evening, an ingest of video content that featured three different file references in the following forms:

All three ingested successfully this time! The output from my test run follows.

╭─markmcfate@MAC02NX13MG5RP ~/GitHub/islandora_workbench ‹ruby-2.3.0› ‹issue-260*›
╰─$ ./workbench --config icg_testing-video.yml
OK, connection to Drupal at https://icg-islandora.williams.edu verified.
Warning: Media creation in your version of Drupal (8.9.14) is less reliable than in Drupal 9.2 or higher.
Node for "A Brief History of Acceleration" (record 20116) created at https://icg-islandora.williams.edu/node/69.
+ Media for https://github.com/Islandora-Collaboration-Group/islandora-sample-objects/raw/master/VIDEO/Video_01/Video_01.mp4 created.
Node for "Blues for C.M." (record 20117) created at https://icg-islandora.williams.edu/node/70.
+ Media for https://github.com/Islandora-Collaboration-Group/islandora-sample-objects/blob/master/VIDEO/Video_02/Video_02.mp4?raw=true created.
Node for "Olivia's Arrival" (record 20118) created at https://icg-islandora.williams.edu/node/71.
+ Media for Video_03.mp4 created.

If I have time this afternoon I'll run a similar test using PDFs. Looking good.

McFateM commented 3 years ago

Just did a PDF ingest test using issue-260 branch of Workbench and had mixed results. It looks like the file handling works properly as I tested this time with references like this:

All three objects were created but I got NO media again. However, this time the media errors in the logs are of the form:

11-May-21 09:06:43 - INFO - Node for A search for antigens common to fetal and tumor cells (record 21001) created at https://icg-islandora.williams.edu/node/76.
11-May-21 09:06:44 - ERROR - Media not created, PUT request to "https://icg-islandora.williams.edu/node/76/media/document/16" returned an HTTP status code of "404".

The two URL file references left behind viable intermediate directories and PDF documents, so that's a good sign. Another member of our ICG testing team is bringing this to DKC's attention now.

Unlike earlier tests that returned 404 errors, this ingest was performed using an admin account that had sufficient privileges to successfully create media for other content types.

mjordan commented 3 years ago

Can you run curl -v -uadmin:islandora "https://icg-islandora.williams.edu/islandora_workbench_integration/core_version" replacing the credentials with the same ones used in your config file and let me know what comes back? Should be something like {"core_version":"9.3.0-dev"}.

The 404 is being generated by the Islandora media REST endpoint, but that endpoint has to exist, otherwise you'd see media not being created in general. Not sure what's going on yet.

Digital-Grinnell commented 3 years ago

Here are the results...

╭─markmcfate@MAC02NX13MG5RP ~/GitHub/islandora_workbench ‹ruby-2.3.0› ‹issue-260*› ╰─$ curl -v -uadmin:xxxxxxxxxxxxx "https://icg-islandora.williams.edu/islandora_workbench_integration/core_version"

Can you run curl -v -uadmin:islandora "https://icg-islandora.williams.edu/islandora_workbench_integration/core_versionhttps://urldefense.proofpoint.com/v2/url?u=https-3A__icg-2Dislandora.williams.edu_islandora-5Fworkbench-5Fintegration_core-5Fversion&d=DwQCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=D8E-oGNaPT9srWV6jE8UP5unsmKEmmHEH-tzgmjBvLk&m=pSPypnl4PUEELu_KWmE4yiKgkkmOfnzIcsuipDUjk1I&s=ARlvm35S-LseqIMWKDp66qWQEbU63KWBaxS6YQkX7Qo&e=" replacing the credentials with the same ones used in your config file and let me know what comes back? Should be something like {"core_version":"9.3.0-dev"}.

The 404 is being generated by the Islandora media REST endpoint, but that endpoint has to exist, otherwise you'd see media not being created in general. Not sure what's going on yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Fworkbench_issues_260-23issuecomment-2D838643345&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=D8E-oGNaPT9srWV6jE8UP5unsmKEmmHEH-tzgmjBvLk&m=pSPypnl4PUEELu_KWmE4yiKgkkmOfnzIcsuipDUjk1I&s=-uG3SCKo-j6t2PKmwcH8ZAH-MkzpXHunoFjEoApWp6k&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADTEQ6GZAAEMA67NCOKTB2LTNFBWJANCNFSM432NCJXQ&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=D8E-oGNaPT9srWV6jE8UP5unsmKEmmHEH-tzgmjBvLk&m=pSPypnl4PUEELu_KWmE4yiKgkkmOfnzIcsuipDUjk1I&s=DC_L98OVSMUEIutHqtFzNzVemMjweIWYntvUDE4nDP8&e=.

mjordan commented 3 years ago

OK, thanks. Can you pull in the latest updates to Islandora Workbench from the main branch and try to ingest those PDFs again?

seth-shaw-unlv commented 3 years ago

@McFateM , can you confirm that you have a "Document" media type configured (/admin/structure/media/manage/document)?

mjordan commented 3 years ago

If that's the issue here, we can make Workbench confirm that the media type exists during its --check phase.

mjordan commented 3 years ago

@Digital-Grinnell OK to close this issue, since you've been able to ingest files whose URLs have a ? query string? I've opened #269 to address verifying media types exist.

McFateM commented 3 years ago

Yes, by all means. Thanks!

Mark A. McFate Digital Library Applications Developer Burling Library, Grinnell College 1111 6th Ave., Grinnell, IA 50112-1690 (641) 269-3674 @.***


From: Mark Jordan @.> Sent: Tuesday, May 11, 2021 9:27 PM To: mjordan/islandora_workbench @.> Cc: McFate, Mark @.>; Mention @.> Subject: Re: [mjordan/islandora_workbench] Media referenced by URL with ?raw=true suffix not loading. (#260)

@Digital-Grinnellhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Digital-2DGrinnell&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=_Ys0Uw6PBiuBCsuMrw74tM84n-6WTLjQvYEHSxP9q1I&s=9F9uzmLH35gIKhYWJQEuL0Sv1fW2NrngfCp6CrYyU8E&e= OK to close this issue, since you've been able to ingest files whose URLs have a ? query string? I've opened #269https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Fworkbench_issues_269&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=_Ys0Uw6PBiuBCsuMrw74tM84n-6WTLjQvYEHSxP9q1I&s=QlBMj5QmPZ62z_z7i9FSu-2pfF1yTSYbPWPKeFpRjuk&e= to address verifying media types exist.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Fworkbench_issues_260-23issuecomment-2D839385353&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=_Ys0Uw6PBiuBCsuMrw74tM84n-6WTLjQvYEHSxP9q1I&s=x0bBDPT1FZSxbqXin6AlByRxevDaYHaH0wDvCjXJ8jY&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACAURQOJPHTYHFP6POGIMQLTNHRPZANCNFSM432NCJXQ&d=DwMCaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=_Ys0Uw6PBiuBCsuMrw74tM84n-6WTLjQvYEHSxP9q1I&s=PQRI53VeEgnnNvLo8dqQ707HF8KQahKJkkBpe1wba7o&e=.