Open bentrevett opened 5 years ago
It might have something to do with the average sentence length (IMDB has a large spread of lengths). I'm unable to check this right now, but could you perhaps test spacy on raw IMDB data (preloaded) and do some speed comparison?
There are some speed issues I've noticed as well.
The IMDB dataset when loaded using the spaCy tokenizer takes a considerable amount of time (>5 minutes) compared to other datasets.
The following takes >5 minutes to run:
Originally, I thought this was a problem with spaCy being slow as with a basic tokenizer the following takes ~2 seconds to run:
However, when using spaCy with a different dataset, here the Multi30k translation dataset, it takes a reasonable amount of time (~10 seconds).
I am not really sure what is causing the issue here. Could it be due to the way the IMDB dataset is stored, with every example in its own .txt file? If so, should there be some processing after downloading to get it in a format that's faster to read?