Closed cassws closed 4 years ago
Quick addition: The workflow that generated the extra <U+FEFF> (BOM) character was: download Google Sheet to Excel format, open in Excel and make some edits, save to .csv (on Mac). (Then reopened the .csv in Excel and saved again.) So I'm not sure when the extra character was introduced, but it wasn't visible when viewing the files in plain text editors. Had to use command line "less" and "more" to detect the error. Because this issue was tricky to diagnose and also a bit tricky to fix, it would be especially nice for the code base to be able to detect and/or handle this extra character in a csv.
Hey @zoews and @amzoss, I'm definitely interested in your ideas here. As some context, wax actually does pre-parsing and format validation (see: utils.rb#L52) , and should throw an InvalidSource error if the file is invalid (see: utils.rb#L17).
It sounds like your case (unfortunately) opens as a valid CSV and doesn't trigger that error, even though it is ultimately formatted incorrectly and therefore hits another error down the chain (MissingPid).
When teaching wax, I usually just tell folks "Please please don't use Excel!!" because it doesn't respect character encodings and frequently does irreversible damage to a file (by adding its own line break codes, denaturing diacritics, etc.) I've mulled over making wax do more here, but have decided not to try correcting encoding errors in the interest of transparency / avoiding "black box" changes. Plus it's just really hard to do 🙃
Some options that seem like an OK balance between transparency + intervention:
<U+FEFF>
) and throw a warning (instead of an error, in case someone means to have them).Do any of these seem reasonable/helpful to you? Thanks for taking the time to share and think about this issue!
Thank you @mnyrop ! And ah ha I see how this may have slipped past that initial layer of CSV validation!
Hm. I definitely hear what you say re: the scope/black box issues with this kind of change, especially given minimalist ethos. It makes sense not to ask Wax to do too much of the heavy lifting around formatting/parsing. I think of what you described, what jumps out to me is no. 2 + no 3. How would you feel about, instead of
WaxTasks::Error::MissingPid: Collection is missing pid for item 0.
Using
WaxTasks::Error::MissingPid: Collection is missing pid for item 0. Hint: review common .csv formatting issues (such as hidden characters) in the documentation: https://minicomp.github.io/wiki/wax/preparing-your-collection-data/metadata/
And then amending the wiki with references to both Excel issues and the linting? That is, if an error message reference to the documentation does not feel superfluous? I remember two students I worked with in summer 2019 encountered a similar barrier around interpreting a pid error, so it may be a worthwhile intervention in terms of removing an early roadblack.
If that seems of interest, I can go ahead and start working on a PR for the error message and documentation changes :) (I'll have to investigate how the rake process works and throws errors, so that might take a bit.)
@zoews That sounds great! Feel free to ping me with questions.
Hi there! @amzoss and I encountered a similar scenario to #65 -- in our case, an Excel encoding issue with our source .csv inserted an extra byte into the head of the csv which provoked the error.
Would it be possible to either:
(1) replace language around "MissingPid" with more generic language referring to issues reading the .csv and direct user to the style guide,
(2) add more specific detection and communication of errors, so opposite direction, or
(3) incorporate some parsing/input handling in the rake tooling, if that's possible? @amzoss used vim to strip out extra bytes as per this, don't know if that'd be relevant.
I'm happy to contribute a PR towards whatever approach makes the most sense! Thank you ^_^