scientist-softserv / adventist_knapsack

Apache License 2.0
1 stars 0 forks source link

🐛 invisible BOM character breaks import #229

Open ShanaLMoore opened 7 months ago

ShanaLMoore commented 7 months ago

Summary

From Katharine: :thread:

"Could someone take a look at my work and give me advice? While prod is busy, I'm hoping to run a test upload on staging, but my import failed, and I'm stumped as to why. The error message says "StandardError - Missing at least one required element, missing element(s) are: identifier" but I have an identifier and identifier.ark column in the CSV I used. I can certainly make this a ticket, but I thought I'd ask quickly on Slack in case you easily see what I did wrong."

Acceptance Criteria

SDAPI-SDAOI-CSV-clean-up-2023-11-07-csv-for-Staging.csv

Screenshots or Video

Image

Testing Instructions

Import the CSV with Bulkrax

Notes

Shana's sleuthing: https://assaydepot.slack.com/archives/C0313NJV9PE/p1699395723234479

ShanaLMoore commented 7 months ago

"identifier" == "identifier" => false

the bytes of each string doesn't match each other, which seems to be why that comparision evaluates to false.

keys.map! do |key|
  key.encode("UTF-8").gsub("\uFEFF", "")
end

resolves it. But why? What's going on?

problematic CSV: SDAPI-SDAOI-CSV-clean-up-2023-11-07-csv-for-Staging.csv

When I tested a different CSV it works: identifier.csv

The problematic CSV had an invisible character at the beginning of it known as the Byte Order Mark (BOM). The BOM is like a secret handshake that tells programs what kind of text encoding the file uses, in this case, UTF-8. However, while this BOM is useful for some programs to understand how to read the file, it can cause trouble when the file is read into a program that doesn't expect it. This BOM isn't part of the actual data we want to work with, but it got treated as if it was, leading to the "identifier" mismatch issue.

KatharineV commented 7 months ago

Thank you SO MUCH for this help! I created this CSV by exporting data from our ILS, Sierra. Then I imported the data into OpenRefine for clean up. Exported out of OpenRefine as a CSV, opened in Excel for final edits, and saved as UTF-8 (because we have Russian characters in the metadata). I expect to use that process again in the future...do you see a step along the way where I could avoid the invisible character or I could manually remove it myself?

KatharineV commented 7 months ago

Maybe scientist-softserv/adventist_knapsack#329 is related to this...and the work for that ticket had temporarily resolved the issue when I was testing on 9/21.

jeremyf commented 7 months ago

We addressed the BOM for header columns in the CSV: https://github.com/samvera/bulkrax/pull/689

It sounds like BOM characters are also part of the data fields. Which will require a different approach.