interrogator / corpkit

A toolkit for corpus linguistics
Other
199 stars 27 forks source link

make_corpus fails with UnicodeDecodeError/TypeError #39

Open alischinsky opened 8 years ago

alischinsky commented 8 years ago

make_corpus() fails when chunking UTF-8 files while parsing. There may be a "decode('utf-8')" missing somewhere.

This is true both in Python2 (log) and Python3 (log).

interrogator commented 8 years ago

Texts are opened through saferead() in corpkit/process.py (line 861)


def saferead(path):
    """
    Read a file with detect encoding
    :returns: text and its encoding
    """
    import chardet
    import sys
    if sys.version_info.major == 3:
        enc = 'utf-8'
        with open(path, 'r', encoding=enc) as fo:
            data = fo.read()
        return data, enc
    else:
        with open(path, 'r') as fo:
            data = fo.read()
        try:
            enc = 'utf-8'
            data = data.decode(enc)
        except UnicodeDecodeError:
            enc = chardet.detect(data)['encoding']
            data = data.decode(enc, errors='ignore')
        return data, enc

I'll be able to get around to these at some point hopefully, but feel free to submit a PR as well! :)

interrogator commented 8 years ago

I tried a quick fix, but it wasn't really much, in f36ac38. Then I tried to reproduce the error, and couldn't. If your data isn't particularly sacred, could I get a copy and try it out? (Or as I said, feel free to submit a PR yourself)