scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

🐛 Related Url imports fail #685

Open ShanaLMoore opened 1 month ago

ShanaLMoore commented 1 month ago

Summary

When there are + in related_urls, Bulkrax replaces it with spaces instead, thus making them invalid.

Importer with related_url fails: https://adl.b2.adventistdigitallibrary.org/importers/380?locale=en

This import's csv's related url looks like:

https://adl-ebstore-redirect.s3.amazonaws.com/CAR/P+Collection/BOX+010/P006171.tif

From the spike ticket, the following was determined by Kirk:

"This seems to be the issue of the CSV's URL's using + which gets stripped out and replaced with a (space) character. We've actually seen this before in this thread. The workaround right now is just to replace the + with %20 instead."

Until Bulkrax is fixed, the client is aware they can update their CSV by replacing + with %20 instead, as a work around.

original/spike ticket:

Acceptance Criteria

Testing Instructions

  1. Create a Bulkrax CSV importer for the sample file
  2. Verify the completion and success of the import

Notes

KatharineV commented 1 month ago

@kirkkwang I replaced the + signs with %20 and tried this importer again, but it still fails with the "missing identifier" error. Is there something else wrong with my CSV that I'm just not seeing because my eyes have glazed over? Is it possible that I need to create a new/fresh importer rather than editing the existing one? Could that possibly be affecting things?

I'm a little anxious and want to fix the works on this importer because we have ~750 records in the repository without files, due to my previous attempt with the + URLs that failed to attach files. I really want to get the files attached to these records before any users complain.

Thanks for looking into this!

Attached: "corrected" CSV with %20 instead of +

BOX-010-corrected-e.csv

KatharineV commented 1 month ago

Update: I ran a test import (number 384 on ADL prod) with related URLs that never had spaces in them, and the importer worked and the files attached.

https://adl.b2.adventistdigitallibrary.org/importers/384?locale=en

So perhaps what I really need your help with is a fix for the ~750 records that are caught (?) in importer 381 that I can't fix, since the cloud storage URLs have the + issue...

jillpe commented 1 month ago

Hi @KatharineV, do we want to make a new ticket to fix the records in 381 and close this one? When do you need these fixed by?

KatharineV commented 1 month ago

@jillpe I'm not sure about a new ticket, but I'm open to the idea. Is it possible that there's a larger issue at play and Bulkrax is no longer responding to Related URLs with %20 as well as +? Let me test a different importer with %20 in the related url field before I determine that the issue is isolated to a single messed-up importer. I also probably need to wait until after Valkyrie to do this testing, because I don't want Fedora instabilities to be a factor in any way.

We need these records fixed by October, which is the deadline the team knows we have in mind for public launch. So there's time to wait and decide about this ticket post-Valkyrie (if work wraps in July as planned).

jillpe commented 1 month ago

@KatharineV leaving this ticket open to verify after Valkyrie and if this is an isolated importer sounds good!. Mainly I wanted to make sure you didn't need these records in for a demo or something more urgent

KatharineV commented 1 month ago

I ran an importer yesterday with three works using AWS S3 related_urls to attach files. The URLs had + in them originally. I replaced with %20 before ever creating and running the importer. The importer is stuck pending. Two other importers created yesterday with no %20 URLs ran just fine. I'm adding this note to keep track of Bulkrax related_url issues prior to Valkyrie, for comparison after upgrades are complete.

https://adl.b2.adventistdigitallibrary.org/importers/386?locale=en

jillpe commented 1 week ago

Verify if this is still a problem after valkyrie