Common Crawl - Githubissues

ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS

https://awesome.ipfs.io/datasets

183 stars 24 forks source link

Common Crawl #162

Open ghost opened 7 years ago

ghost commented 7 years ago

https://commoncrawl.org/

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

I'm not sure how much data it is, but certainly a few TB.

ghost commented 7 years ago

Oh:

The crawl archive for October 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-43/. It contains 3.65 billion web pages and over 300 TiB of uncompressed content.