Closed marikgoldstein closed 7 years ago
The fix commit changes several calls to open() to io.open() in torchtext/datasets/translation.py. This allows to explicitly specify utf-8 encoding when reading, which does not happen by default in python2.
Same thing happened to me.
`
def __init__(self, path, exts, fields, **kwargs):
"""Create a TranslationDataset given paths and fields.
Arguments:
path: Common prefix of paths to the data files for both languages.
exts: A tuple containing the extension to path for each language.
fields: A tuple containing the fields that will be used for data
in each language.
Remaining keyword arguments: Passed to the constructor of
data.Dataset.
"""
if not isinstance(fields[0], (tuple, list)):
fields = [('src', fields[0]), ('trg', fields[1])]
src_path, trg_path = tuple(os.path.expanduser(path + x) for x in exts)
examples = []
with open(src_path, encoding='utf-8') as src_file, open(trg_path, encoding='utf-8') as trg_file:
for src_line, trg_line in zip(src_file, trg_file):
src_line, trg_line = src_line.strip(), trg_line.strip()
if src_line != '' and trg_line != '':
examples.append(data.Example.fromlist(
[src_line, trg_line], fields))
super(TranslationDataset, self).__init__(examples, fields, **kwargs)
`
I changed the code in translation.py with open(src_path) as src_file, open(trg_path) as trg_file: --> with open(src_path, encoding='utf-8') as src_file, open(trg_path, encoding='utf-8') as trg_file:
@righ120 when #426 lands, this should be fixed. Please let me know if it isn't
@nelson-liu: I incorrectly brought this up in pull #52, new issue here
When trying to load splits for IWSLT (in french, german, etc...), the loading process would fail with an ascii encoding/decoding error:
These are my library versions:
Here is the code that I was using, from test/translation.py:
The following fixed it for me, in torchtext/datasets/translation.py. Replace opens with io.opens specifying utf-8 for python2. It's worth noting that a friend with python3 did not have this problem.
@jekbradbury, you were correct in pull #52 that I didn't need the middle block explicitly encoding/decoding (not seen here) since the file is already open as utf-8.