Open cphsolutionslab opened 10 years ago
I've knocked together a similar case but can't reproduce the bug:
saved_file: cat,dog,mouseß æï,ñø,åÅ
code: from messytables import any_tableset, headers_guess for i in (1,): f = open('saved_file', 'rb') table_sets = any_tableset(f) row_set = table_sets.tables[0] print list(row_set) offset, headers = headers_guess(row_set.sample) print offset, headers
for row in row_set: for cell in row: print cell.value
output: [[<Cell(String:u'cat'>, <Cell(String:u'dog'>, <Cell(String:u'mouse\xdf'>], [<Cell(String:u'\xe6\xef'>, <Cell(String:u'\xf1\xf8'>, <Cell(String:u'\xe5\xc5'>]] 0 [u'cat', u'dog', u'mouse\xdf'] cat dog mouseß æï ñø åÅ
which seems right to me. (This happens regardless of whether the file is CSV or XLS)
1) Is it stripping characters or replacing them with gibberish? 2) Can you provide the values of content_type and resource['format'].lower()? 3) Could you run this code with your file and let us know whether it produces the expected strings? 4) If your spreadsheet isn't sensitive, could you let us see it?
Dave.
On Tue, Oct 22, 2013 at 8:57 AM, bu1g notifications@github.com wrote:
The code below removes international characters, or so it seems. When I open the saved_file, which is UTF-8, and print out the content all my international characters are correctly shown. After the any_tableset is run and I print out all the rows, all the international characters are removed.
f = open(result['saved_file'], 'rb') try: table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() ) # only first sheet in xls for time being row_set = table_sets.tables[0] offset, headers = headers_guess(row_set.sample)
— Reply to this email directly or view it on GitHubhttps://github.com/okfn/messytables/issues/99 .
It's currently an issue with DataStorer in CKAN: http://data.kk.dk/dataset/betalingszoner/resource/cde21ea2-6f87-46e1-be1f-f7a0d2cfc985
Debugging it resolved in the characters being removed (not gibberish but stripped) just after:
table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )
Can you 100% confirm that your file is utf-8? Does chardet think your file is utf-8?
This is the output: {'confidence': 0.99, 'encoding': 'utf-8'}
after this: f = open(result['saved_file'], 'rb') print chardet.detect(f.read()) try: table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )
I don't think the file is UTF-8:
(python 2)
f= open("ows.csv", "r").read() f[-100:] '03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n betalingszone,2010-02-22,2013-07-17,21\r\n'
Given that this is a bytestring, it shouldn't contain any single non-ASCII characters (since all UTF8 characters that aren't ASCII are multibyte)
'file' agrees: $ file ows.csv ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators
Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the easiest things to prove some text isn't.
On Tue, Oct 22, 2013 at 2:29 PM, bu1g notifications@github.com wrote:
This is the output: {'confidence': 0.99, 'encoding': 'utf-8'}
after this:
f = open(result['saved_file'], 'rb') print chardet.detect(f.read())
try: table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )
—
Reply to this email directly or view it on GitHubhttps://github.com/okfn/messytables/issues/99#issuecomment-26802438 .
commas.py 24: self.reader = codecs.getreader(encoding)(f, 'ignore') ... I'm not sure silently ignoring characters that don't decode will ever be the correct behaviour.
Also, only checking the start of the file causes misdetection in chardet, due to the big polygons
chardet.detect(open("/home/dragon/ows.csv").read()) {'confidence': 0.766658867395801, 'encoding': 'ISO-8859-2'} chardet.detect(open("/home/dragon/ows.csv").read(2000)) {'confidence': 1.0, 'encoding': 'ascii'}
We could fall back to a possibly-mangling latin-1 instead of an always-wrong UTF-8 minus the bad bits. It'd be good to warn when this had occurred.
On Tue, Oct 22, 2013 at 4:30 PM, Dave McKee dragon@scraperwiki.com wrote:
I don't think the file is UTF-8:
(python 2)
f= open("ows.csv", "r").read() f[-100:] '03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n betalingszone,2010-02-22,2013-07-17,21\r\n'
Given that this is a bytestring, it shouldn't contain any single non-ASCII characters (since all UTF8 characters that aren't ASCII are multibyte)
'file' agrees: $ file ows.csv ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators
Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the easiest things to prove some text isn't.
On Tue, Oct 22, 2013 at 2:29 PM, bu1g notifications@github.com wrote:
This is the output: {'confidence': 0.99, 'encoding': 'utf-8'}
after this:
f = open(result['saved_file'], 'rb') print chardet.detect(f.read())
try: table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )
—
Reply to this email directly or view it on GitHubhttps://github.com/okfn/messytables/issues/99#issuecomment-26802438 .
My result of the following: f = open(result['saved_file'], 'rb') f_test = f.read() print f_test[-100:] gives my this: 651972, 12.579757761196365 55.66998698041308))",Grøn,Grøn betalingszone,2010-02-22,2013-07-17,21
But yes, it does says ascii when using read(2000)...
Ah; I was in an interactive Python session - so
f[-100:]
is equivalent to
print repr(f[-100:])
When I try to open the file as UTF-8: f = codecs.open(result['saved_file'], 'rb', 'utf-8') I get the error:
/usr/lib/ckan/default/local/lib/python2.7/site-packages/chardet/universaldetector.py:90: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if aBuf[:len(chunk)] == chunk: 2013-10-29 11:09:22,057 ERROR [root] 'ascii' codec can't encode character u'\xd8' in position 1456: ordinal not in range(128) ...
Any idea on how to fix this? UTF-8 is a superset of ASCII.
? I think I got lost somewhere... Should messytables be able to handle ISO-8859-1 encoded files? Should messytables be able to handle ascii characters?
I'm kinda lost if I should fis the file, messytables or...?
The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.
Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k. I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that. Your data isn't UTF-8. Attempting to decode it as such is bound to fail.
I believe this would be fixed if: 1) we didn't truncate the analysis at 2k 2) we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly
Removing the 2K limit fixed the issue... Should this be a permanent solution?
And thank you for patience with me. I really do appreciate it.
Glad to hear the specific problem is fixed!
Making it permament is a little more complicated; there's a need for some users to not load the whole file in multiple times. We'll have to think about it.
The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.
Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k. I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that. Your data isn't UTF-8. Attempting to decode it as such is bound to fail.
I believe this would be fixed if:
1. we didn't truncate the analysis at 2k 2. we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly
There are plans to solve this problem. I am quite interested in solving this problem.
The code below removes international characters, or so it seems. When I open the saved_file, which is UTF-8, and print out the content all my international characters are correctly shown. After the any_tableset is run and I print out all the rows, all the international characters are removed.