pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 810 forks source link

Translation datasets not automatically downloading #312

Closed bentrevett closed 6 years ago

bentrevett commented 6 years ago

Code:


from torchtext.data import Field
from torchtext.datasets import Multi30k

DE = Field(init_token='<sos>', eos_token='<eos>')
EN = Field(init_token='<sos>', eos_token='<eos>')

train, val, test = Multi30k.splits(exts=('.de', '.en'), fields=(DE, EN))

Error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-3-637d49b65435> in <module>()
----> 1 train, val, test = Multi30k.splits(exts=('.de', '.en'), fields=(DE, EN))

~/miniconda3/envs/pytorch/lib/python3.6/site-packages/torchtext/datasets/translation.py in splits(cls, exts, fields, root, train, validation, test, **kwargs)
     99         """
    100         return super(Multi30k, cls).splits(
--> 101             exts, fields, root, train, validation, test, **kwargs)
    102 
    103 

~/miniconda3/envs/pytorch/lib/python3.6/site-packages/torchtext/datasets/translation.py in splits(cls, exts, fields, path, root, train, validation, test, **kwargs)
     62 
     63         train_data = None if train is None else cls(
---> 64             os.path.join(path, train), exts, fields, **kwargs)
     65         val_data = None if validation is None else cls(
     66             os.path.join(path, validation), exts, fields, **kwargs)

~/miniconda3/envs/pytorch/lib/python3.6/site-packages/torchtext/datasets/translation.py in __init__(self, path, exts, fields, **kwargs)
     31 
     32         examples = []
---> 33         with open(src_path) as src_file, open(trg_path) as trg_file:
     34             for src_line, trg_line in zip(src_file, trg_file):
     35                 src_line, trg_line = src_line.strip(), trg_line.strip()

FileNotFoundError: [Errno 2] No such file or directory: '.data/val.de'

It just doesn't seem to automatically download the data for both the Multi30k and WMT14 datasets.

PyTorch version: 0.3.1 TorchText version 0.2.3

EDIT

I have downgraded my TorchText to version 0.2.1 and I do not get the error, had a quick look at the commits between 0.2.1 and 0.2.3 and couldn't figure out which commit introduced the break.

domaala commented 6 years ago

I think that this got broken in #208, where the additional argument path was added to the splits method of TranslationDataset, but Multi30k and WMT14 super calls to that method have not been updated to accommodate for the change.

@jekbradbury Want to take a look?

aa1607 commented 6 years ago

I got around this quite easily by downloading with Multi30k.download(DATAROOT) and then just using TranslationDataset.splits instead of Multi30k.splits. Pass the rootpath to the path argument instead of the root argument

from torchtext.datasets import TranslationDataset, Multi30k
ROOT = '~/Python/DATASETS/Multi30k/'
Multi30k.download(ROOT)

(trnset, valset, testset) = TranslationDataset.splits(   
                                      path       = ROOT,  
                                      exts       = ['.en', '.de'],   
                                      fields     = [('src', srcfield), ('trg',tgtfield)],
                                      test       = 'test2016'
                                      )

I use this function (after downloading) to preprocess the data and get the iterators

import spacy
from torchtext.data import BucketIterator, interleave_keys, Field 
from onmt.inputters import OrderedIterator

def  prep_torchtext_multi30k( 
                          dataroot = '~/Python/DATASETS/Multi30k/',
                          maxwords = 12000, 
                          bsize =32, 
                          langs = ['de','en'],
                          exts =  ['.en','.de'],
                          ):

    # modifies dataset loader from https://github.com/A-Jacobson/minimal-nmt

    try:     de, en  = [ load_multi30k_torchtext.nlp.get(lang) for lang in langs]
    except:  de, en  = [ spacy.load(lang, disable=['tagger', 'parser', 'ner']) for lang in langs] 
    prep_torchtext_multi30k.nlp = {'en':en, 'de':de}   # repeatedly loading spacy models can use lots of mem

    def tok_src(text): return [tok.text for tok in de.tokenizer(text) if not tok.is_space]
    def tok_tgt(text): return [tok.text for tok in en.tokenizer(text) if not tok.is_space]

    SRC = Field( tokenize = tok_src, init_token='<s>',  eos_token='</s>' )
    TGT = Field( tokenize = tok_tgt, init_token='<s>',  eos_token='</s>' )

    trnset, valset, testset = TranslationDataset.splits(   
                                      path       = dataroot,  
                                      exts       = exts,   
                                      fields     = [('src', SRC), ('trg',TGT)],
                                      train      = 'train', 
                                      validation = 'val', 
                                      test       = 'test2016')

    for (nm, field) in [('src', SRC), ('trg',TGT)]:  
        trnsubset = getattr(trnset, nm) 
        field.build_vocab( trnsubset, max_size = maxwords)

    # ONMT's OrderedIterator --> subclasses BucketIterator but better at packing batches together. 
    # also want to use torchtext's  interleave_keys -> minimizes padding on both src and tgt sides

    trniter, valiter, tstiter = OrderedIterator.splits(   
                                   datasets = [trnset, valset, testset], 
                                   batch_size = bsize, 
                                   sort_key = lambda ex: interleave_keys(len(ex.src), len(ex.trg)),
                                   device='cuda' )

    return (trnset, valset, testset), (trniter, valiter, tstiter), (SRC.vocab, TGT.vocab)
rumaak commented 6 years ago

This problem breaks a functionality of part of library, it's about 3 months old and (correct me if I am wrong) all it takes to fix this is to add path argument to both splits() and super(Multi30k, cls).splits(). How is this issue not fixed yet (or why there isn't even a PR)?

If no one else wants to, I can submit a PR.

n0obcoder commented 5 years ago

i am not able to download the IMDB dataset. any idea why this is happening? Following is my code... ####################### import torch from torchtext import data from torchtext import datasets from nltk import word_tokenize train_data , test_data = datasets.IMDB.splits(TEXT, LABEL) print(len(train_data)) ########################## len(train_data) shows 0. pls help

mttk commented 5 years ago

Using the following code:

from torchtext import data, datasets
TEXT = data.Field()
LABEL = data.Field()
train, test = datasets.IMDB.splits(TEXT, LABEL)
print(len(train))

25000 Everything seems to work fine. I'm running this on the current pip install of torchtext.

n0obcoder commented 5 years ago

@mttk i figured out that i had to add the 'root' argument in the split function. so i modified the line of code to train_data , test_data = datasets.IMDB.splits(TEXT, LABEL, root = 'data') #the data will be downloaded in the root dir and then the data got downloaded in the specified root directory. thnaks anyways :D

n0obcoder commented 5 years ago

i have one more doubt tho...when i run this code, the data is tokenized every time whoch takes about 2 mins. this is irritating. so i tried pickling the output and loading it the next time whne i run the code. but it doesnt seem to be working... can somebody pls help me with this...i am sharing the code below....\ ########################## import torch from torchtext import data from torchtext import datasets from nltk import word_tokenize import time, pickle, os

def tokenizer(text): # create a tokenizer function return word_tokenize(text)

TEXT = data.Field(tokenize = tokenizer) LABEL = data.LabelField(dtype = torch.float)

pkl_name = 'train_test_data.pickle'

if not os.path.exists(pkl_name): print('downloading or tokenizing the text...') start = time.time() train_data , test_data = datasets.IMDB.splits(TEXT, LABEL, root = 'data') #the data will be downloaded in the root dir print('tokenizer took {} secs'.format(time.time() - start))

with open(pkl_name, 'wb') as f:
    pickle.dump([train_data, test_data], f, protocol = pickle.HIGHEST_PROTOCOL)
print('pickle dumped !!!', '\n')

else: print('loading the pickle') with open(pkl_name, 'rb') as f: train_data, test_data = pickle.load(f) print('pickle loaded !!!', '\n')

print('len(train_data): ', len(train_data)) print('len(test_data): ', len(test_data), '\n') ###########################