Open feynmanliang opened 8 years ago
This can get pretty tricky with text encodings. My preference is to always operate with unicode, because then iterating over a string is guaranteed to iterate over a "letter" instead of iterating over parts of multi-byte characters. That said, I haven't been very careful about enforcing this!
This is additionally complicated by the fact that Py2 and Py3 have different defaults for handling strings. I personally use Py3 but I try to test everything with Py2 as well (see the Travis config).
Which version of Python are you using? Can you try using a "unicode" object instead of a UTF-8 encoded byte sequence to see if this problem persists? Can you add a test to run a unicode object through the recurrent infrastructure and add it to this PR? Also, this PR breaks an existing test, please fix.
Thanks for taking a look, I will push some changes soon to address the issues
path
points to a file with utf8 encoded strings)with codecs.open(path, 'r', 'utf-8') as handle:
file_data = handle.read().lower()
text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])
or using a unicode
object
with open(path, 'r') as handle:
file_data = unicode(handle.read(), 'utf-8').lower()
text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])
Attempting to use
theanets.recurrent.Text
on a UTF8 encoded corpus used to give an errorThis is fixed by this PR.
This change is