Open piskvorky opened 6 years ago
This is super-large, need a new store for it http://commoncrawl.org/2017/02/january-2017-crawl-archive-now-available/ (this is 01-2017 dump size)
What does "super-large" mean, can you be more specific?
EDIT: If I'm reading the article correctly, we seem to need 8.97 TiB for the 57800 files in WET (plaintext) format. Is that right?
@piskvorky not quite, 8.97 ТiB for 57800 compressed WET files. More than that, this is data about 1-year-old dump (now dump is bigger).
super large: significantly more than current wiki
dump (I mean we can add something <=10GB
), but more - this is really problematic.
In addition to the fact that we need a different repository for "super large" files, we will also have to implement the "resume of downloading" (it's rather difficult).
OK, this one seems to be a challenge :-)
Maybe subsample?
@piskvorky maybe it will be a good idea, but what's size we should choose for the sample and how to mark explicitly that this is "sample", probably "sample" prefix in dataset name?
Yes. Size: probably a few GBs of bz2 plaintext or JSON.
E.g. from Amazon's official Common Crawl dataset: https://aws.amazon.com/public-datasets/common-crawl/
By the way, the "official" pre-trained gloVe vectors were trained on this. It would be interesting to compare to other models trained on the same dataset ("official" word2vec was trained on Google News, a different corpus, using completely different preprocessing, so not directly comparable)