scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

Spike: CSV import tests with diacritics are failing #329

Open KatharineV opened 1 year ago

KatharineV commented 1 year ago

Hi team. We need to use diacritics in our metadata, so I'm testing if it's possible to import UTF-8 CSVs via Bulkrax. Based on my testing, it looks like the imports fail, but the reported error is not accurate. Bulkrax tells me that the error is "StandardError - Missing at least one required element, missing element(s) are: identifier." However, the CSV does include this required field.

Here are two importers side-by-side. The first one was successful. The file I used was a CSV with no diacritics. The identifier and identifier.ark fields were filled with unique IDs. For the second test, I only changed three criteria: I added diacritics to the Title, I entered new unique IDs, and I saved as UTF-8. The second test failed for missing identifiers.

I'd like your help to let me know if UTF-8 files with diacritics are supposed to work or if failure is expected. Is the UTF-8 / diacritics combo actually causing the "missing identifier" error here? If UTF-8 CSVs aren't the answer, is there another way I can prepare a CSV for upload while keeping diacritics intact? Thanks for advising.

jillpe commented 11 months ago

Goal is to identify why this is happening and if it can be done without failing

kirkkwang commented 11 months ago

@KatharineV I re-ran the importer with what seemingly looks like the same csv and it worked.

2023-08-30_identifier_test_UTF-8.csv

KatharineV commented 11 months ago

I'm a little discouraged because I can't get the importers to work like you did. I just ran importers on staging and production with a CSV (UTF-8, like before) that contains a single diacritic on line 14. They failed and I got this error code: Error: CSV::MalformedCSVError - Invalid byte sequence in UTF-8 in line 14.

Importer on staging: https://sdapi.s2.adventistdigitallibrary.org/importers/68?locale=en Importer on production: https://sdapi.b2.adventistdigitallibrary.org/importers/70?locale=en

Zip file I used for the recent examples is attached.

Record_SPD_2023_07_01 (2).zip

I could use some help. :(

EDITED TO ADD Ugh, I'm doing so much testing in so many windows. Maybe this CSV was not UTF-8. Stand by. Will update with a new comment after I try again.

KatharineV commented 11 months ago

Update! Here are the most recent importers with (confirmed, ugh) UTF-8 CSVs.

Staging was successful: https://sdapi.s2.adventistdigitallibrary.org/importers/69?locale=en

Production failed: https://sdapi.b2.adventistdigitallibrary.org/importers/71?locale=en

I used the same zipped files and CSV for both importers. Only the environment was different. So maybe the diacritics aren't the problem, and this is related to scientist-softserv/adventist_knapsack#334 instead. If you agree, then this ticket is done and I'd close it...but I'd like confirmation from someone else that I'm reading the situation correctly.

KatharineV commented 10 months ago

scientist-softserv/adventist_knapsack#229 may be related? If so, the issue could be UTF-8, which we do need to work because we have metadata with diacritics fairly regularly.