Closed tokestermw closed 7 years ago
The unfortunate answer is that I personally don't have any tools related to common crawl, and the lab seems to have misplaced the original scripts used for GloVe. I have gotten this question from others though, so it would be nice to have some nontrivial answer. Of course you can look through the website itself: http://commoncrawl.org/. But it is a somewhat painful process to get up to speed with, as you have to learn both s3 tools and their warc format with slow scripts operating over gigabytes of data. If anyone is aware of a tutorial that tells people the easiest way to get up to speed with common crawl in python/java/c++, please reply here. Otherwise, I can't give too much help.
Is either the Common Crawl data or the script to get the data available anywhere?
Thanks.