pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 815 forks source link

IMDB data with spaCy loads much slower than other datasets #481

Open bentrevett opened 5 years ago

bentrevett commented 5 years ago

The IMDB dataset when loaded using the spaCy tokenizer takes a considerable amount of time (>5 minutes) compared to other datasets.

The following takes >5 minutes to run:

from torchtext import data
from torchtext import datasets

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

Originally, I thought this was a problem with spaCy being slow as with a basic tokenizer the following takes ~2 seconds to run:

from torchtext import data
from torchtext import datasets

def tokenize(s):
    return s.split(' ')

TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

However, when using spaCy with a different dataset, here the Multi30k translation dataset, it takes a reasonable amount of time (~10 seconds).

from torchtext import data
from torchtext import datasets

import spacy

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

SRC = Field(tokenize=tokenize_de)
TRG = Field(tokenize=tokenize_en)

train_data, valid_data, test_data = datasets.Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))

I am not really sure what is causing the issue here. Could it be due to the way the IMDB dataset is stored, with every example in its own .txt file? If so, should there be some processing after downloading to get it in a format that's faster to read?

mttk commented 5 years ago

It might have something to do with the average sentence length (IMDB has a large spread of lengths). I'm unable to check this right now, but could you perhaps test spacy on raw IMDB data (preloaded) and do some speed comparison?

There are some speed issues I've noticed as well.