UTF-8 encoding of csv - Githubissues

mdaeron / D47crunch

Python library for processing and standardizing carbonate clumped-isotope analyses, from low-level data out of a dual-inlet mass spectrometer to final, “absolute” Δ47, Δ48, and Δ49 values with fully propagated analytical error estimates.

MIT License

8 stars 3 forks source link

UTF-8 encoding of csv #17

Open timpol opened 1 year ago

timpol commented 1 year ago

Just a suggestion regarding the documentation...

It seems thatD47data.read() will fail if the input csv file is in UTF-8 encoding. This is not a problem with D47crunch, but may trip up some new users if they were to produce their csv file using the default csv format of certain spreadsheet programs.

Would it be worth putting a warning about this in section 1.2 of the documentation just in case?

mdaeron commented 1 year ago

Thanks for reporting the issue.

Would it be worth putting a warning about this in section 1.2 of the documentation just in case?

Would you mind sharing a CSV that fails? The best option may be to correct this behavior.

timpol commented 1 year ago

I should have mentioned that it depends on the ordering of the columns as to whether or not it fails...

If you take the rawdata.csv example file and re-order the columns so that any column other than UID or Session is first, then save as a .csv in UTF-8 encoding from the spreadsheet program, it fails with a KeyError: 'Sample' on line for s in sorted({r['Sample'] for r in self}) when self.refresh_samples() is called (or another such key error).

I think this is because the spreadsheet program puts a UTF-8 byte order mark (BOM) at the start of the file, so D47crunch can't find the column headings it expects to see.

mdaeron commented 1 year ago

In that case, the best case scenario is that another field gets a spurious BOM in its name, which is likely to cause cryptic problems down the line. If you send me an example file I can try to debug.

The longer-term solution, of course, it to use a more robust csv parsing implementation. There are probably many out there.

On the other hand, you could argue that producing clean csv files is Excel's job, not D47crunch's. Implementing corrective measures in every library likely to encounter this format is expensive. Converting “quasi-CSV” to ”bare-CSV” between Excel and these libraries would probably be less expensive. YMMV.

timpol commented 1 year ago

I agree 100% that it is an Excel problem not a D47crunch problem... I just thought it might be discouraging for a relatively new python user trying to get their D47crunch / D47calib script working.

A simple error msg informing the user that they need to clean up their filthy csv file could do the job?

See file attached: rawdata_dirty.csv