Open alischinsky opened 8 years ago
Texts are opened through saferead()
in corpkit/process.py
(line 861)
def saferead(path):
"""
Read a file with detect encoding
:returns: text and its encoding
"""
import chardet
import sys
if sys.version_info.major == 3:
enc = 'utf-8'
with open(path, 'r', encoding=enc) as fo:
data = fo.read()
return data, enc
else:
with open(path, 'r') as fo:
data = fo.read()
try:
enc = 'utf-8'
data = data.decode(enc)
except UnicodeDecodeError:
enc = chardet.detect(data)['encoding']
data = data.decode(enc, errors='ignore')
return data, enc
I'll be able to get around to these at some point hopefully, but feel free to submit a PR as well! :)
I tried a quick fix, but it wasn't really much, in f36ac38. Then I tried to reproduce the error, and couldn't. If your data isn't particularly sacred, could I get a copy and try it out? (Or as I said, feel free to submit a PR yourself)
make_corpus() fails when chunking UTF-8 files while parsing. There may be a "decode('utf-8')" missing somewhere.
This is true both in Python2 (log) and Python3 (log).