piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Add web corpus and pre-trained models #6

Open piskvorky opened 6 years ago

piskvorky commented 6 years ago

E.g. from Amazon's official Common Crawl dataset: https://aws.amazon.com/public-datasets/common-crawl/

By the way, the "official" pre-trained gloVe vectors were trained on this. It would be interesting to compare to other models trained on the same dataset ("official" word2vec was trained on Google News, a different corpus, using completely different preprocessing, so not directly comparable)

menshikh-iv commented 6 years ago

This is super-large, need a new store for it http://commoncrawl.org/2017/02/january-2017-crawl-archive-now-available/ (this is 01-2017 dump size)

piskvorky commented 6 years ago

What does "super-large" mean, can you be more specific?

EDIT: If I'm reading the article correctly, we seem to need 8.97 TiB for the 57800 files in WET (plaintext) format. Is that right?

menshikh-iv commented 6 years ago

@piskvorky not quite, 8.97 ТiB for 57800 compressed WET files. More than that, this is data about 1-year-old dump (now dump is bigger).

super large: significantly more than current wiki dump (I mean we can add something <=10GB), but more - this is really problematic.

In addition to the fact that we need a different repository for "super large" files, we will also have to implement the "resume of downloading" (it's rather difficult).

piskvorky commented 6 years ago

OK, this one seems to be a challenge :-)

Maybe subsample?

menshikh-iv commented 6 years ago

@piskvorky maybe it will be a good idea, but what's size we should choose for the sample and how to mark explicitly that this is "sample", probably "sample" prefix in dataset name?

piskvorky commented 6 years ago

Yes. Size: probably a few GBs of bz2 plaintext or JSON.