Open dantheta opened 6 years ago
This looks great :)
Could we escalate this import and load it into the Test scheduler?
Flagging this as a good next step
I think what we want is probably here but it's 218GB https://commoncrawl.s3.amazonaws.com/projects/url-index/url-index.1356128792
More information here: https://commoncrawl.org/2013/01/common-crawl-url-index/
I did import a load of commoncrawl data (just at domain level, not individual page).
It was imported through the generic CSV importer having pre-processed some of the index files. It's repeatable (as a manual exercise), but not automated or scripted.
The CC data I was using was at https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/cc-index.paths.gz
Is the dataset available for testing? I couldn't see it as available for test scheduling.
http://commoncrawl.org/ - searchable by cctld.