Open ShanaLMoore opened 7 months ago
"identifier" == "identifier" => false
the bytes of each string doesn't match each other, which seems to be why that comparision evaluates to false.
keys.map! do |key|
key.encode("UTF-8").gsub("\uFEFF", "")
end
resolves it. But why? What's going on?
problematic CSV: SDAPI-SDAOI-CSV-clean-up-2023-11-07-csv-for-Staging.csv
When I tested a different CSV it works: identifier.csv
The problematic CSV had an invisible character at the beginning of it known as the Byte Order Mark (BOM). The BOM is like a secret handshake that tells programs what kind of text encoding the file uses, in this case, UTF-8. However, while this BOM is useful for some programs to understand how to read the file, it can cause trouble when the file is read into a program that doesn't expect it. This BOM isn't part of the actual data we want to work with, but it got treated as if it was, leading to the "identifier" mismatch issue.
Thank you SO MUCH for this help! I created this CSV by exporting data from our ILS, Sierra. Then I imported the data into OpenRefine for clean up. Exported out of OpenRefine as a CSV, opened in Excel for final edits, and saved as UTF-8 (because we have Russian characters in the metadata). I expect to use that process again in the future...do you see a step along the way where I could avoid the invisible character or I could manually remove it myself?
Maybe scientist-softserv/adventist_knapsack#329 is related to this...and the work for that ticket had temporarily resolved the issue when I was testing on 9/21.
We addressed the BOM for header columns in the CSV: https://github.com/samvera/bulkrax/pull/689
It sounds like BOM characters are also part of the data fields. Which will require a different approach.
Summary
From Katharine: :thread:
"Could someone take a look at my work and give me advice? While prod is busy, I'm hoping to run a test upload on staging, but my import failed, and I'm stumped as to why. The error message says "StandardError - Missing at least one required element, missing element(s) are: identifier" but I have an identifier and identifier.ark column in the CSV I used. I can certainly make this a ticket, but I thought I'd ask quickly on Slack in case you easily see what I did wrong."
Acceptance Criteria
SDAPI-SDAOI-CSV-clean-up-2023-11-07-csv-for-Staging.csv
Screenshots or Video
Testing Instructions
Import the CSV with Bulkrax
Notes
Shana's sleuthing: https://assaydepot.slack.com/archives/C0313NJV9PE/p1699395723234479