Closed ustctf-zz closed 6 years ago
Can you provide the full code? Some variables are not yet defined in your snippet. Also, use code formatting ("<>" ) for better readability.
I have exactly the same issue. The full code is from a tutorial, as follows:
# For data loading.
from torchtext import data, datasets
if True:
import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')
def tokenize_de(text):
return [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
BOS_WORD = u'<s>'
EOS_WORD = u'</s>'
BLANK_WORD = u"<blank>"
SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD,
eos_token = EOS_WORD, pad_token=BLANK_WORD)
MAX_LEN = 100
train, val, test = datasets.IWSLT.splits(
exts=('.de', '.en'), fields=(SRC, TGT),
filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
len(vars(x)['trg']) <= MAX_LEN)
MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)
The error that results is
.data/iwslt/de-en/IWSLT16.TED.tst2013.de-en.en.xml
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-51-d787c2f8b289> in <module>()
24 train, val, test = datasets.IWSLT.splits(
25 exts=('.de', '.en'), fields=(SRC, TGT),
---> 26 filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
27 len(vars(x)['trg']) <= MAX_LEN)
28 MIN_FREQ = 2
/usr/local/lib/python2.7/dist-packages/torchtext/datasets/translation.pyc in splits(cls, exts, fields, root, train, validation, test, **kwargs)
138
139 if not os.path.exists(os.path.join(path, train) + exts[0]):
--> 140 cls.clean(path)
141
142 train_data = None if train is None else cls(
/usr/local/lib/python2.7/dist-packages/torchtext/datasets/translation.pyc in clean(path)
158 for doc in root.findall('doc'):
159 for e in doc.findall('seg'):
--> 160 fd_txt.write(e.text.strip() + '\n')
161
162 xml_tags = ['<url', '<keywords', '<talkid', '<description',
TypeError: write() argument 1 must be unicode, not str
same error, very simple code to trigger it:
from torchtext import data
from torchtext import datasets
inputs = data.Field(lower=True, include_lengths=True, batch_first=True)
train, dev, test = datasets.IWSLT.splits(root='.data', exts=['.en', '.de'], fields=[inputs, inputs])
@tkondrashov python 2.7, right?
Hi,
I'm using the IWSLT translation dataset in torchtext. However I found the following encoding errors. The code snippet is:
MAX_LEN = 100 train, val, test = datasets.IWSLT.splits( exts=('.de', '.en'), fields=(SRC, TGT), filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and len(vars(x)['trg']) <= MAX_LEN)
Indeed in an third-party implementation of Transformer using pytorch: https://github.com/harvardnlp/annotated-transformer/blob/master/The%20Annotated%20Transformer.ipynb
The error is:
.data/iwslt/de-en/IWSLT16.TED.tst2011.de-en.en.xml Traceback (most recent call last): File "The+Annotated+Transformer.py", line 773, in
filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
File "/usr/anaconda2/lib/python2.7/site-packages/torchtext/datasets/translation.py", line 140, in splits
cls.clean(path)
File "/usr/anaconda2/lib/python2.7/site-packages/torchtext/datasets/translation.py", line 160, in clean
fd_txt.write(e.text.strip() + '\n')
TypeError: write() argument 1 must be unicode, not str
Would you help to check the issue? Thanks!
Systems config: Ubuntu 14.04, python 2.7.
Best, Fei