Open ghost opened 7 years ago
https://commoncrawl.org/
We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.
I'm not sure how much data it is, but certainly a few TB.
Oh:
The crawl archive for October 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-43/. It contains 3.65 billion web pages and over 300 TiB of uncompressed content.
https://commoncrawl.org/
I'm not sure how much data it is, but certainly a few TB.