Open zarino opened 10 years ago
To add insult to injury, if I open the CSV in Excel, save it as an XLS file, and then upload that, I end up with the "ragged_rows" error from #16.
Grrrr!!
I agree that "it should just work", but we cannot tell what encoding a CSV file uses. What do you suggest we do?
We could use something like chardet (https://pypi.python.org/pypi/chardet) to guess the encoding:
$ chardet *.csv
G-Cloud-Total-Spend-11-09-13.csv: ISO-8859-2 (confidence: 0.66)
(by inspecting the file it seems to be in ISO-8859-1, so chardet gets it wrong in this particular case sigh).
The best I can suggest is that we have a list of encodings, and go through them one at a time. ISO-8859-1 should be last (because it will always work).
For bonus points, can you tell what character appears after the string "Memstore" on line 348 of this CSV? The byte at this position is 0x99 which is not a valid character in any ISO-8859-* encoding which is why file
says "non-ISO extended-ASCII" I think.
The non-ASCII characters that appear in this CSV file are:
0xa0 - NBSP (non-breaking space, weird, but fine)
0xa3 - £ (pound sign, fine)
0xae - ® (registered trademark, fine)
0x92
0x96
0x99
Aha! I suddenly realise that the encoding is Windows-1252 which uses positions 0x80 to 0x9f for printable characters.
The wikipedia page http://en.wikipedia.org/wiki/Windows-1252 says that it is a superset of iso-8859-1 and that the HTML 5 standard requires that documents advertising themselves as iso-8859-1 should in fact be parsed as if they were Windows-1252 (because it's common for Windows-1252 documents to be mislabelled as iso-8859-1).
So I now think we could reasonably go:
Extracting this CSV:
http://gcloud.civilservice.gov.uk/files/2013/09/G-Cloud-Total-Spend-11-09-13.csv
Results in this error:
Running the file through
file
on my Mac says:We shouldn't be farting around with text encodings. This stuff should just work!!