pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 810 forks source link

Unicode Error in Using IWSLT dataset: TypeError: write() argument 1 must be unicode, not str #289

Closed ustctf-zz closed 6 years ago

ustctf-zz commented 6 years ago

Hi,

I'm using the IWSLT translation dataset in torchtext. However I found the following encoding errors. The code snippet is:

MAX_LEN = 100 train, val, test = datasets.IWSLT.splits( exts=('.de', '.en'), fields=(SRC, TGT), filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and len(vars(x)['trg']) <= MAX_LEN)

Indeed in an third-party implementation of Transformer using pytorch: https://github.com/harvardnlp/annotated-transformer/blob/master/The%20Annotated%20Transformer.ipynb

The error is:

.data/iwslt/de-en/IWSLT16.TED.tst2011.de-en.en.xml Traceback (most recent call last): File "The+Annotated+Transformer.py", line 773, in filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and File "/usr/anaconda2/lib/python2.7/site-packages/torchtext/datasets/translation.py", line 140, in splits cls.clean(path) File "/usr/anaconda2/lib/python2.7/site-packages/torchtext/datasets/translation.py", line 160, in clean fd_txt.write(e.text.strip() + '\n') TypeError: write() argument 1 must be unicode, not str

Would you help to check the issue? Thanks!

Systems config: Ubuntu 14.04, python 2.7.

Best, Fei

domaala commented 6 years ago

Can you provide the full code? Some variables are not yet defined in your snippet. Also, use code formatting ("<>" ) for better readability.

suwangcompling commented 6 years ago

I have exactly the same issue. The full code is from a tutorial, as follows:

# For data loading.
from torchtext import data, datasets

if True:
    import spacy
    spacy_de = spacy.load('de')
    spacy_en = spacy.load('en')

    def tokenize_de(text):
        return [tok.text for tok in spacy_de.tokenizer(text)]

    def tokenize_en(text):
        return [tok.text for tok in spacy_en.tokenizer(text)]

    BOS_WORD = u'<s>'
    EOS_WORD = u'</s>'
    BLANK_WORD = u"<blank>"
    SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
    TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, 
                     eos_token = EOS_WORD, pad_token=BLANK_WORD)

    MAX_LEN = 100

    train, val, test = datasets.IWSLT.splits(
        exts=('.de', '.en'), fields=(SRC, TGT), 
        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
            len(vars(x)['trg']) <= MAX_LEN)
    MIN_FREQ = 2
    SRC.build_vocab(train.src, min_freq=MIN_FREQ)
    TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

The error that results is

.data/iwslt/de-en/IWSLT16.TED.tst2013.de-en.en.xml
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-51-d787c2f8b289> in <module>()
     24     train, val, test = datasets.IWSLT.splits(
     25         exts=('.de', '.en'), fields=(SRC, TGT),
---> 26         filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
     27             len(vars(x)['trg']) <= MAX_LEN)
     28     MIN_FREQ = 2

/usr/local/lib/python2.7/dist-packages/torchtext/datasets/translation.pyc in splits(cls, exts, fields, root, train, validation, test, **kwargs)
    138 
    139         if not os.path.exists(os.path.join(path, train) + exts[0]):
--> 140             cls.clean(path)
    141 
    142         train_data = None if train is None else cls(

/usr/local/lib/python2.7/dist-packages/torchtext/datasets/translation.pyc in clean(path)
    158                 for doc in root.findall('doc'):
    159                     for e in doc.findall('seg'):
--> 160                         fd_txt.write(e.text.strip() + '\n')
    161 
    162         xml_tags = ['<url', '<keywords', '<talkid', '<description',

TypeError: write() argument 1 must be unicode, not str
skondrashov commented 6 years ago

same error, very simple code to trigger it:

from torchtext import data
from torchtext import datasets

inputs = data.Field(lower=True, include_lengths=True, batch_first=True)
train, dev, test = datasets.IWSLT.splits(root='.data', exts=['.en', '.de'], fields=[inputs, inputs])
mttk commented 6 years ago

@tkondrashov python 2.7, right?