stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.82k stars 1.51k forks source link

Is the Common Crawl data available? #37

Closed tokestermw closed 7 years ago

tokestermw commented 8 years ago

Is either the Common Crawl data or the script to get the data available anywhere?

Thanks.

ghost commented 8 years ago

The unfortunate answer is that I personally don't have any tools related to common crawl, and the lab seems to have misplaced the original scripts used for GloVe. I have gotten this question from others though, so it would be nice to have some nontrivial answer. Of course you can look through the website itself: http://commoncrawl.org/. But it is a somewhat painful process to get up to speed with, as you have to learn both s3 tools and their warc format with slow scripts operating over gigabytes of data. If anyone is aware of a tutorial that tells people the easiest way to get up to speed with common crawl in python/java/c++, please reply here. Otherwise, I can't give too much help.