openthc / ccrs

Tracking for the WSLCB CCRS
https://lcb.wa.gov/ccrs
MIT License
4 stars 0 forks source link

CCRS PRR files include \t in fields, row widths vary #49

Open ripnyt-ripnyt opened 1 year ago

ripnyt-ripnyt commented 1 year ago

name column in Stains_0.csv include \t, which is the same as the file's delimiter. Makes reading the data difficult.

djbusby commented 1 year ago

This is an issue with the CCRS specification. The LCB is aware of it and has no plans to fix it. Because the LCB will NOT support CSV files with quoted columns (which is part of the CSV specification) it's recommended to strip "exotic" characters.

When it comes to reading garbage data out of the CCRS files it's a little more work. When the scripts run, they should produce some barf when there are problems with the lines. Capture that output (eg tee) and then one can manually identify those records for re-import.

Could also bring this issue to the LCB -- however you should know it's been raised numerous times, by numerous vendors since at least 2021-12-04 and the LCB response has been 'WONTFIX'.

I'll leave this open for conversation (if needed) for a few days; but will close out in ~14 days.

ripnyt-ripnyt commented 1 year ago

Thanks for the well informed response. I've also found the encoding of the raw files to be a little confusing as well. Is it utf-16-le or Latin1? Yes the data is very garbage and I've been fighting it for a week now.

djbusby commented 1 year ago

They are UTF-16; take look at the code here: https://github.com/openthc/data/tree/master/bin/ccrs Specifically in inflate.php, which calls iconv to move from UTF-16 to UTF-8.

ripnyt-ripnyt commented 1 year ago

Thank you.

djbusby commented 1 year ago

Related to #43