scraperwiki / spreadsheet-upload-tool

A ScraperWiki tool for uploading structured data from a CSV or Excel spreadsheet
3 stars 5 forks source link

UnicodeDecodeError #18

Open zarino opened 10 years ago

zarino commented 10 years ago

Extracting this CSV:

http://gcloud.civilservice.gov.uk/files/2013/09/G-Cloud-Total-Spend-11-09-13.csv

Results in this error:

UnicodeDecodeError 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)

Running the file through file on my Mac says:

$ file G-Cloud-Total-Spend-11-09-13.csv 
G-Cloud-Total-Spend-11-09-13.csv: Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators

We shouldn't be farting around with text encodings. This stuff should just work!!

zarino commented 10 years ago

To add insult to injury, if I open the CSV in Excel, save it as an XLS file, and then upload that, I end up with the "ragged_rows" error from #16.

Grrrr!!

drj11 commented 10 years ago

I agree that "it should just work", but we cannot tell what encoding a CSV file uses. What do you suggest we do?

We could use something like chardet (https://pypi.python.org/pypi/chardet) to guess the encoding:

$ chardet *.csv
G-Cloud-Total-Spend-11-09-13.csv: ISO-8859-2 (confidence: 0.66)

(by inspecting the file it seems to be in ISO-8859-1, so chardet gets it wrong in this particular case sigh).

The best I can suggest is that we have a list of encodings, and go through them one at a time. ISO-8859-1 should be last (because it will always work).

drj11 commented 10 years ago

For bonus points, can you tell what character appears after the string "Memstore" on line 348 of this CSV? The byte at this position is 0x99 which is not a valid character in any ISO-8859-* encoding which is why file says "non-ISO extended-ASCII" I think.

The non-ASCII characters that appear in this CSV file are:

0xa0 - NBSP (non-breaking space, weird, but fine)
0xa3 - £ (pound sign, fine)
0xae - ® (registered trademark, fine)
0x92
0x96
0x99

Aha! I suddenly realise that the encoding is Windows-1252 which uses positions 0x80 to 0x9f for printable characters.

The wikipedia page http://en.wikipedia.org/wiki/Windows-1252 says that it is a superset of iso-8859-1 and that the HTML 5 standard requires that documents advertising themselves as iso-8859-1 should in fact be parsed as if they were Windows-1252 (because it's common for Windows-1252 documents to be mislabelled as iso-8859-1).

drj11 commented 10 years ago

So I now think we could reasonably go: