Closed bentrevett closed 6 years ago
I think that this got broken in #208, where the additional argument path
was added to the splits
method of TranslationDataset
, but Multi30k
and WMT14
super calls to that method have not been updated to accommodate for the change.
@jekbradbury Want to take a look?
I got around this quite easily by downloading with Multi30k.download(DATAROOT) and then just using TranslationDataset.splits instead of Multi30k.splits. Pass the rootpath to the path argument instead of the root argument
from torchtext.datasets import TranslationDataset, Multi30k
ROOT = '~/Python/DATASETS/Multi30k/'
Multi30k.download(ROOT)
(trnset, valset, testset) = TranslationDataset.splits(
path = ROOT,
exts = ['.en', '.de'],
fields = [('src', srcfield), ('trg',tgtfield)],
test = 'test2016'
)
I use this function (after downloading) to preprocess the data and get the iterators
import spacy
from torchtext.data import BucketIterator, interleave_keys, Field
from onmt.inputters import OrderedIterator
def prep_torchtext_multi30k(
dataroot = '~/Python/DATASETS/Multi30k/',
maxwords = 12000,
bsize =32,
langs = ['de','en'],
exts = ['.en','.de'],
):
# modifies dataset loader from https://github.com/A-Jacobson/minimal-nmt
try: de, en = [ load_multi30k_torchtext.nlp.get(lang) for lang in langs]
except: de, en = [ spacy.load(lang, disable=['tagger', 'parser', 'ner']) for lang in langs]
prep_torchtext_multi30k.nlp = {'en':en, 'de':de} # repeatedly loading spacy models can use lots of mem
def tok_src(text): return [tok.text for tok in de.tokenizer(text) if not tok.is_space]
def tok_tgt(text): return [tok.text for tok in en.tokenizer(text) if not tok.is_space]
SRC = Field( tokenize = tok_src, init_token='<s>', eos_token='</s>' )
TGT = Field( tokenize = tok_tgt, init_token='<s>', eos_token='</s>' )
trnset, valset, testset = TranslationDataset.splits(
path = dataroot,
exts = exts,
fields = [('src', SRC), ('trg',TGT)],
train = 'train',
validation = 'val',
test = 'test2016')
for (nm, field) in [('src', SRC), ('trg',TGT)]:
trnsubset = getattr(trnset, nm)
field.build_vocab( trnsubset, max_size = maxwords)
# ONMT's OrderedIterator --> subclasses BucketIterator but better at packing batches together.
# also want to use torchtext's interleave_keys -> minimizes padding on both src and tgt sides
trniter, valiter, tstiter = OrderedIterator.splits(
datasets = [trnset, valset, testset],
batch_size = bsize,
sort_key = lambda ex: interleave_keys(len(ex.src), len(ex.trg)),
device='cuda' )
return (trnset, valset, testset), (trniter, valiter, tstiter), (SRC.vocab, TGT.vocab)
This problem breaks a functionality of part of library, it's about 3 months old and (correct me if I am wrong) all it takes to fix this is to add path
argument to both splits()
and super(Multi30k, cls).splits()
. How is this issue not fixed yet (or why there isn't even a PR)?
If no one else wants to, I can submit a PR.
i am not able to download the IMDB dataset. any idea why this is happening? Following is my code... ####################### import torch from torchtext import data from torchtext import datasets from nltk import word_tokenize train_data , test_data = datasets.IMDB.splits(TEXT, LABEL) print(len(train_data)) ########################## len(train_data) shows 0. pls help
Using the following code:
from torchtext import data, datasets
TEXT = data.Field()
LABEL = data.Field()
train, test = datasets.IMDB.splits(TEXT, LABEL)
print(len(train))
25000 Everything seems to work fine. I'm running this on the current pip install of torchtext.
@mttk i figured out that i had to add the 'root' argument in the split function. so i modified the line of code to train_data , test_data = datasets.IMDB.splits(TEXT, LABEL, root = 'data') #the data will be downloaded in the root dir and then the data got downloaded in the specified root directory. thnaks anyways :D
i have one more doubt tho...when i run this code, the data is tokenized every time whoch takes about 2 mins. this is irritating. so i tried pickling the output and loading it the next time whne i run the code. but it doesnt seem to be working... can somebody pls help me with this...i am sharing the code below....\ ########################## import torch from torchtext import data from torchtext import datasets from nltk import word_tokenize import time, pickle, os
def tokenizer(text): # create a tokenizer function return word_tokenize(text)
TEXT = data.Field(tokenize = tokenizer) LABEL = data.LabelField(dtype = torch.float)
pkl_name = 'train_test_data.pickle'
if not os.path.exists(pkl_name): print('downloading or tokenizing the text...') start = time.time() train_data , test_data = datasets.IMDB.splits(TEXT, LABEL, root = 'data') #the data will be downloaded in the root dir print('tokenizer took {} secs'.format(time.time() - start))
with open(pkl_name, 'wb') as f:
pickle.dump([train_data, test_data], f, protocol = pickle.HIGHEST_PROTOCOL)
print('pickle dumped !!!', '\n')
else: print('loading the pickle') with open(pkl_name, 'rb') as f: train_data, test_data = pickle.load(f) print('pickle loaded !!!', '\n')
print('len(train_data): ', len(train_data)) print('len(test_data): ', len(test_data), '\n') ###########################
Code:
Error:
It just doesn't seem to automatically download the data for both the Multi30k and WMT14 datasets.
PyTorch version: 0.3.1 TorchText version 0.2.3
EDIT
I have downgraded my TorchText to version 0.2.1 and I do not get the error, had a quick look at the commits between 0.2.1 and 0.2.3 and couldn't figure out which commit introduced the break.