invalid character in dictionary

JosuaKrause commented 9 years ago

When trying to create the dictionary for a certain type the python json encoder chokes on an invalid character. This happens when trying to create the dictionary for patient 3106AE3A8FE383F4 and when trying to convert the column name of the 1977th column of the ./feature_extraction/extract.py output with standard parameters.

err_dict.txt in this case:

    enrichDict(info['output'], info['mid'])
  File "./build_dictionary.py", line 367, in enrichDict
    print(json.dumps(dict, indent=2), file=output)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 250, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 209, in encode
    chunks = list(chunks)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
    yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 1: invalid start byte

JosuaKrause commented 9 years ago

The offending column is diagnosis__38600. The crash can be reproduced with ./build_dictionary.py -c config.txt -f format.json --lookup diagnosis__38600

JosuaKrause commented 9 years ago

The corresponding name to diagnosis 386.0 is M‚niŠre's disease -- obviously mojibake

JosuaKrause commented 9 years ago

The file is apparently in ISO 8859: Mnire's disease

A quick fix: cd code/icd9 && iconv -f ISO-8859-1 -t UTF-8 ucod.txt > ucod2.txt && mv ucod.txt ucod_old.txt && mv ucod2.txt ucod.txt

remram44 commented 9 years ago

You could open the file with io.open(..., 'r', encoding='iso-8859-1') instead of open(..., 'r') so that it gives you correctly decoded unicode instead of str, that you can feed to the JSON encoder.

chardet can detect the encoding in most cases, of course there's no way to be 100% sure with these things.

remram44 commented 9 years ago

(from hangout)

Using the codecs lib instead of io.open() will allow you to use a non-strict error handler:

import codecs

def wrap(fp):
    reader = codecs.getreader('utf-8')
    return reader(fp, 'replace')

from StringIO import StringIO

f1 = StringIO(b'r\xC3\xA9mi')  # be sure to open files in binary mode
f2 = StringIO(b'r\xE9mi')
#f3 = open('some/file.txt', 'rb')

assert wrap(f1).read() == u'r\xE9mi'
assert wrap(f2).read() == u'r\uFFFDmi'

JosuaKrause commented 9 years ago

I guess I will opt for the non-strict error handling. thanks :)

JosuaKrause commented 9 years ago

Okay, we figured out what encoding is used: somebody thought it was iso-8859-1 encoded and converted it to utf-8 but the encoding is infact ibm437. So the correct quick-fix is:

cd code/icd9 && iconv -f utf-8 -t iso-8859-1 ucod.txt | iconv -f ibm437 -t utf-8 > ucod2.txt && mv ucod.txt ucod_old.txt && mv ucod2.txt ucod.txt

JosuaKrause commented 9 years ago

okay -- finally the quick fix has landed :) I take back the bit about iso-8859-1 -- I converted it by accident (I'm the somebody from above...) -- ibm437 is the correct encoding

JosuaKrause commented 9 years ago

I guess that's all there is to do...

nyuvis / patient-viz

invalid character in dictionary #35