okfn / messytables

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py
http://messytables.readthedocs.io/
388 stars 110 forks source link

International characters are removed. #99

Open cphsolutionslab opened 10 years ago

cphsolutionslab commented 10 years ago

The code below removes international characters, or so it seems. When I open the saved_file, which is UTF-8, and print out the content all my international characters are correctly shown. After the any_tableset is run and I print out all the rows, all the international characters are removed.

    f = open(result['saved_file'], 'rb')
    try:
        table_sets = any_tableset(
            f,
            mimetype=content_type,
            extension=resource['format'].lower()
        )
        # only first sheet in xls for time being
        row_set = table_sets.tables[0]
        offset, headers = headers_guess(row_set.sample)
scraperdragon commented 10 years ago

I've knocked together a similar case but can't reproduce the bug:

saved_file: cat,dog,mouseß æï,ñø,åÅ

code: from messytables import any_tableset, headers_guess for i in (1,): f = open('saved_file', 'rb') table_sets = any_tableset(f) row_set = table_sets.tables[0] print list(row_set) offset, headers = headers_guess(row_set.sample) print offset, headers

for row in row_set: for cell in row: print cell.value

output: [[<Cell(String:u'cat'>, <Cell(String:u'dog'>, <Cell(String:u'mouse\xdf'>], [<Cell(String:u'\xe6\xef'>, <Cell(String:u'\xf1\xf8'>, <Cell(String:u'\xe5\xc5'>]] 0 [u'cat', u'dog', u'mouse\xdf'] cat dog mouseß æï ñø åÅ

which seems right to me. (This happens regardless of whether the file is CSV or XLS)

1) Is it stripping characters or replacing them with gibberish? 2) Can you provide the values of content_type and resource['format'].lower()? 3) Could you run this code with your file and let us know whether it produces the expected strings? 4) If your spreadsheet isn't sensitive, could you let us see it?

Dave.

On Tue, Oct 22, 2013 at 8:57 AM, bu1g notifications@github.com wrote:

The code below removes international characters, or so it seems. When I open the saved_file, which is UTF-8, and print out the content all my international characters are correctly shown. After the any_tableset is run and I print out all the rows, all the international characters are removed.

f = open(result['saved_file'], 'rb')
try:
    table_sets = any_tableset(
        f,
        mimetype=content_type,
        extension=resource['format'].lower()
    )
    # only first sheet in xls for time being
    row_set = table_sets.tables[0]
    offset, headers = headers_guess(row_set.sample)

— Reply to this email directly or view it on GitHubhttps://github.com/okfn/messytables/issues/99 .

cphsolutionslab commented 10 years ago

It's currently an issue with DataStorer in CKAN: http://data.kk.dk/dataset/betalingszoner/resource/cde21ea2-6f87-46e1-be1f-f7a0d2cfc985

Debugging it resolved in the characters being removed (not gibberish but stripped) just after:

table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )

rossjones commented 10 years ago

Can you 100% confirm that your file is utf-8? Does chardet think your file is utf-8?

cphsolutionslab commented 10 years ago

This is the output: {'confidence': 0.99, 'encoding': 'utf-8'}

after this: f = open(result['saved_file'], 'rb') print chardet.detect(f.read()) try: table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )

scraperdragon commented 10 years ago

I don't think the file is UTF-8:

(python 2)

f= open("ows.csv", "r").read() f[-100:] '03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n betalingszone,2010-02-22,2013-07-17,21\r\n'

Given that this is a bytestring, it shouldn't contain any single non-ASCII characters (since all UTF8 characters that aren't ASCII are multibyte)

'file' agrees: $ file ows.csv ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators

Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the easiest things to prove some text isn't.

On Tue, Oct 22, 2013 at 2:29 PM, bu1g notifications@github.com wrote:

This is the output: {'confidence': 0.99, 'encoding': 'utf-8'}

after this:

f = open(result['saved_file'], 'rb') print chardet.detect(f.read())

try: table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )

Reply to this email directly or view it on GitHubhttps://github.com/okfn/messytables/issues/99#issuecomment-26802438 .

scraperdragon commented 10 years ago

commas.py 24: self.reader = codecs.getreader(encoding)(f, 'ignore') ... I'm not sure silently ignoring characters that don't decode will ever be the correct behaviour.

Also, only checking the start of the file causes misdetection in chardet, due to the big polygons

chardet.detect(open("/home/dragon/ows.csv").read()) {'confidence': 0.766658867395801, 'encoding': 'ISO-8859-2'} chardet.detect(open("/home/dragon/ows.csv").read(2000)) {'confidence': 1.0, 'encoding': 'ascii'}

We could fall back to a possibly-mangling latin-1 instead of an always-wrong UTF-8 minus the bad bits. It'd be good to warn when this had occurred.

On Tue, Oct 22, 2013 at 4:30 PM, Dave McKee dragon@scraperwiki.com wrote:

I don't think the file is UTF-8:

(python 2)

f= open("ows.csv", "r").read() f[-100:] '03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n betalingszone,2010-02-22,2013-07-17,21\r\n'

Given that this is a bytestring, it shouldn't contain any single non-ASCII characters (since all UTF8 characters that aren't ASCII are multibyte)

'file' agrees: $ file ows.csv ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators

Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the easiest things to prove some text isn't.

On Tue, Oct 22, 2013 at 2:29 PM, bu1g notifications@github.com wrote:

This is the output: {'confidence': 0.99, 'encoding': 'utf-8'}

after this:

f = open(result['saved_file'], 'rb') print chardet.detect(f.read())

try: table_sets = any_tableset( f, mimetype=content_type, extension=resource['format'].lower() )

Reply to this email directly or view it on GitHubhttps://github.com/okfn/messytables/issues/99#issuecomment-26802438 .

cphsolutionslab commented 10 years ago

My result of the following: f = open(result['saved_file'], 'rb') f_test = f.read() print f_test[-100:] gives my this: 651972, 12.579757761196365 55.66998698041308))",Grøn,Grøn betalingszone,2010-02-22,2013-07-17,21

cphsolutionslab commented 10 years ago

But yes, it does says ascii when using read(2000)...

scraperdragon commented 10 years ago

Ah; I was in an interactive Python session - so f[-100:] is equivalent to print repr(f[-100:])

cphsolutionslab commented 10 years ago

When I try to open the file as UTF-8: f = codecs.open(result['saved_file'], 'rb', 'utf-8') I get the error:

/usr/lib/ckan/default/local/lib/python2.7/site-packages/chardet/universaldetector.py:90: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if aBuf[:len(chunk)] == chunk: 2013-10-29 11:09:22,057 ERROR [root] 'ascii' codec can't encode character u'\xd8' in position 1456: ordinal not in range(128) ...

Any idea on how to fix this? UTF-8 is a superset of ASCII.

cphsolutionslab commented 10 years ago

? I think I got lost somewhere... Should messytables be able to handle ISO-8859-1 encoded files? Should messytables be able to handle ascii characters?

I'm kinda lost if I should fis the file, messytables or...?

scraperdragon commented 10 years ago

The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.

Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k. I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that. Your data isn't UTF-8. Attempting to decode it as such is bound to fail.

I believe this would be fixed if: 1) we didn't truncate the analysis at 2k 2) we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly

cphsolutionslab commented 10 years ago

Removing the 2K limit fixed the issue... Should this be a permanent solution?

And thank you for patience with me. I really do appreciate it.

scraperdragon commented 10 years ago

Glad to hear the specific problem is fixed!

Making it permament is a little more complicated; there's a need for some users to not load the whole file in multiple times. We'll have to think about it.

guibos commented 3 years ago

The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.

Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k. I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that. Your data isn't UTF-8. Attempting to decode it as such is bound to fail.

I believe this would be fixed if:

1. we didn't truncate the analysis at 2k

2. we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly

There are plans to solve this problem. I am quite interested in solving this problem.