IMDB data with spaCy loads much slower than other datasets

The IMDB dataset when loaded using the spaCy tokenizer takes a considerable amount of time (>5 minutes) compared to other datasets.

The following takes >5 minutes to run:

from torchtext import data
from torchtext import datasets

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

Originally, I thought this was a problem with spaCy being slow as with a basic tokenizer the following takes ~2 seconds to run:

from torchtext import data
from torchtext import datasets

def tokenize(s):
    return s.split(' ')

TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

However, when using spaCy with a different dataset, here the Multi30k translation dataset, it takes a reasonable amount of time (~10 seconds).

from torchtext import data
from torchtext import datasets

import spacy

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

SRC = Field(tokenize=tokenize_de)
TRG = Field(tokenize=tokenize_en)

train_data, valid_data, test_data = datasets.Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))

I am not really sure what is causing the issue here. Could it be due to the way the IMDB dataset is stored, with every example in its own .txt file? If so, should there be some processing after downloading to get it in a format that's faster to read?

pytorch / text

IMDB data with spaCy loads much slower than other datasets #481