Closed JosuaKrause closed 9 years ago
The offending column is diagnosis__38600
.
The crash can be reproduced with ./build_dictionary.py -c config.txt -f format.json --lookup diagnosis__38600
The corresponding name to diagnosis 386.0 is M‚niŠre's disease
-- obviously mojibake
The file is apparently in ISO 8859: Mnire's disease
A quick fix: cd code/icd9 && iconv -f ISO-8859-1 -t UTF-8 ucod.txt > ucod2.txt && mv ucod.txt ucod_old.txt && mv ucod2.txt ucod.txt
You could open the file with io.open(..., 'r', encoding='iso-8859-1')
instead of open(..., 'r')
so that it gives you correctly decoded unicode instead of str, that you can feed to the JSON encoder.
chardet can detect the encoding in most cases, of course there's no way to be 100% sure with these things.
(from hangout)
Using the codecs lib instead of io.open() will allow you to use a non-strict
error handler:
import codecs
def wrap(fp):
reader = codecs.getreader('utf-8')
return reader(fp, 'replace')
from StringIO import StringIO
f1 = StringIO(b'r\xC3\xA9mi') # be sure to open files in binary mode
f2 = StringIO(b'r\xE9mi')
#f3 = open('some/file.txt', 'rb')
assert wrap(f1).read() == u'r\xE9mi'
assert wrap(f2).read() == u'r\uFFFDmi'
I guess I will opt for the non-strict error handling. thanks :)
Okay, we figured out what encoding is used: somebody thought it was iso-8859-1
encoded and converted it to utf-8
but the encoding is infact ibm437
. So the correct quick-fix is:
cd code/icd9 && iconv -f utf-8 -t iso-8859-1 ucod.txt | iconv -f ibm437 -t utf-8 > ucod2.txt && mv ucod.txt ucod_old.txt && mv ucod2.txt ucod.txt
okay -- finally the quick fix has landed :) I take back the bit about iso-8859-1
-- I converted it by accident (I'm the somebody from above...) -- ibm437
is the correct encoding
I guess that's all there is to do...
When trying to create the dictionary for a certain type the python json encoder chokes on an invalid character. This happens when trying to create the dictionary for patient
3106AE3A8FE383F4
and when trying to convert the column name of the 1977th column of the./feature_extraction/extract.py
output with standard parameters.err_dict.txt
in this case: