Handle unicode text - Githubissues

feynmanliang commented 8 years ago

Attempting to use theanets.recurrent.Text on a UTF8 encoded corpus used to give an error

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/home/fl350/bachbot/scripts/theanet/theanet.py in <module>()
     24 with codecs.open(path, 'r', 'utf-8') as handle:
     25     file_data = handle.read().lower()
---> 26     text = theanets.recurrent.Text(file_data[:int(VAL_FRACTION*len(file_data))])
     27     text_val = theanets.recurrent.Text(file_data[int(VAL_FRACTION*len(file_data)):])
     28

/home/fl350/theanets/theanets/recurrent.py in __init__(self, text, alpha, min_count, unknown)
     89                 collections.Counter(text).items()
     90                 if char != unknown and count >= min_count)))
---> 91         print type(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'))
     92         self.text = re.sub(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'), unknown, text)
     93         assert unknown not in self.alpha

UnicodeEncodeError: 'ascii' codec can't encode character u'\x83' in position 85: ordinal not in range(128)

This is fixed by this PR.

This change is

coveralls commented 8 years ago

Coverage decreased (-0.1%) to 94.768% when pulling eaca4337d972edfe1d44a93e2d93701dbab98766 on feynmanliang:text-handle-utf into b637b01bc4f1ef69fda9a23f5637462a1188ebdb on lmjohns3:master.

lmjohns3 commented 8 years ago

This can get pretty tricky with text encodings. My preference is to always operate with unicode, because then iterating over a string is guaranteed to iterate over a "letter" instead of iterating over parts of multi-byte characters. That said, I haven't been very careful about enforcing this!

This is additionally complicated by the fact that Py2 and Py3 have different defaults for handling strings. I personally use Py3 but I try to test everything with Py2 as well (see the Travis config).

Which version of Python are you using? Can you try using a "unicode" object instead of a UTF-8 encoded byte sequence to see if this problem persists? Can you add a test to run a unicode object through the recurrent infrastructure and add it to this PR? Also, this PR breaks an existing test, please fix.

feynmanliang commented 8 years ago

Thanks for taking a look, I will push some changes soon to address the issues

feynmanliang commented 8 years ago

I'm using 2.7.3
I can repro with the following code (assuming path points to a file with utf8 encoded strings)

with codecs.open(path, 'r', 'utf-8') as handle:
    file_data = handle.read().lower()
    text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
    text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])

or using a unicode object

with open(path, 'r') as handle:
    file_data = unicode(handle.read(), 'utf-8').lower()
    text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
    text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])

lmjohns3 / theanets

Handle unicode text #131