ascii vs. utf-8 in torchtext/datasets/translation.py

marikgoldstein commented 7 years ago

@nelson-liu: I incorrectly brought this up in pull #52, new issue here

When trying to load splits for IWSLT (in french, german, etc...), the loading process would fail with an ascii encoding/decoding error:

.data/iwslt/de-en/IWSLT16.TED.dev2010.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2013.de-en.de.xml
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN))
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 116, in splits
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 136, in clean
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 60: ordinal not in range(128)

These are my library versions:

numpy==1.13.3
regex==2017.9.23
spacy==1.9.0
torch==0.2.0.post4
torchtext==0.2.0b0 (just cloned a few minutes before error)
torchvision==0.1.9

Here is the code that I was using, from test/translation.py:

from torchtext import data
from torchtext import datasets

import re
import spacy
import sys

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

url = re.compile('(<url>.*</url>)')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(url.sub('@URL@', text))]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(url.sub('@URL@', text))]

# Testing IWSLT
DE = data.Field(tokenize=tokenize_de)
EN = data.Field(tokenize=tokenize_en)
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN))

The following fixed it for me, in torchtext/datasets/translation.py. Replace opens with io.opens specifying utf-8 for python2. It's worth noting that a friend with python3 did not have this problem.

127     @staticmethod
128     def clean(path):
129         for f_xml in glob.iglob(os.path.join(path, '*.xml')):
130             print(f_xml)
131             f_txt = os.path.splitext(f_xml)[0]
132             import io
133             with io.open(f_txt, mode="w", encoding="utf-8") as fd_txt: <--- INSERT
134             #with open(f_txt, 'w') as fd_txt: <--- COMMENT
135                 root = ET.parse(f_xml).getroot()[0]
136                 for doc in root.findall('doc'):
137                     for e in doc.findall('seg'):
138                         fd_txt.write(e.text.strip() + '\n')
139         xml_tags = ['<url', '<keywords', '<talkid', '<description',
140                     '<reviewer', '<translator', '<title', '<speaker']
141         for f_orig in glob.iglob(os.path.join(path, 'train.tags*')):
142             print(f_orig)
143             f_txt = f_orig.replace('.tags', '')
144             with io.open(f_txt,mode='w',encoding='utf-8') as fd_txt, io.open(f_orig,mode='r',encoding='utf=8') as fd_orig: <--- INSERT
145             #with open(f_txt, 'w') as fd_txt, open(f_orig) as fd_orig: <--- COMMENT
146                 for l in fd_orig:
147                     if not any(tag in l for tag in xml_tags):
148                         fd_txt.write(l.strip() + '\n')

@jekbradbury, you were correct in pull #52 that I didn't need the middle block explicitly encoding/decoding (not seen here) since the file is already open as utf-8.

marikgoldstein commented 7 years ago

The fix commit changes several calls to open() to io.open() in torchtext/datasets/translation.py. This allows to explicitly specify utf-8 encoding when reading, which does not happen by default in python2.

righ120 commented 6 years ago

Same thing happened to me.

`

def __init__(self, path, exts, fields, **kwargs):
    """Create a TranslationDataset given paths and fields.

    Arguments:
        path: Common prefix of paths to the data files for both languages.
        exts: A tuple containing the extension to path for each language.
        fields: A tuple containing the fields that will be used for data
            in each language.
        Remaining keyword arguments: Passed to the constructor of
            data.Dataset.
    """
    if not isinstance(fields[0], (tuple, list)):
        fields = [('src', fields[0]), ('trg', fields[1])]

    src_path, trg_path = tuple(os.path.expanduser(path + x) for x in exts)

    examples = []
    with open(src_path, encoding='utf-8') as src_file, open(trg_path, encoding='utf-8') as trg_file:
        for src_line, trg_line in zip(src_file, trg_file):
            src_line, trg_line = src_line.strip(), trg_line.strip()
            if src_line != '' and trg_line != '':
                examples.append(data.Example.fromlist(
                    [src_line, trg_line], fields))

    super(TranslationDataset, self).__init__(examples, fields, **kwargs)

`

I changed the code in translation.py with open(src_path) as src_file, open(trg_path) as trg_file: --> with open(src_path, encoding='utf-8') as src_file, open(trg_path, encoding='utf-8') as trg_file:

mttk commented 6 years ago

@righ120 when #426 lands, this should be fixed. Please let me know if it isn't

pytorch / text

ascii vs. utf-8 in torchtext/datasets/translation.py #131