move-coop / parsons

A python library of connectors for the progressive community.
https://www.parsonsproject.org/
Other
264 stars 132 forks source link

Issue Passing Non utf-8 files Through Box.get_table_by_file_id Method #500

Closed MJones36 closed 2 years ago

MJones36 commented 3 years ago

While trying to load some files from Box into redshift using the box.get_table_by_file_id method, we got the following error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5561: invalid continuation byte. We realized that we were only getting this error on excel files that were not saved as CSV utf-8 by a third party.

Using the chardet package, we attempted to figure out a way to detect the encoding of the excel saved CSVs. The package detected with a confidence of 1.0, that the files were ascii but when we tried to pass that encoding to parsons.from_csv and petl.fromcsv we still got a UnicodeDecodeError implying that ascii is also not correct.

We would love to have a way to reliably detect, anticipate, or at least attempt typical excel encodings and be able to pass those through methods that use .fromcsv

dannyboy15 commented 3 years ago

I think you may be using the wrong encoding. The 0xed byte appears to be the í unicode character. I would recommend trying the iso-8859-1 encoding.

Additionally, here's a gist for how I would try to load those files. It requires the method added by #510.

Hope this helps.

neverett commented 3 years ago

@MJones36 does this address the issue or should we leave this open and mark as a bug?