Pull test URLs from commoncrawl.org

openrightsgroup / blocked-org-uk

Template front-end code, markup, style-sheets, images and other assets for the Censorship Monitoring Project (blocked.org.uk)

https://www.blocked.org.uk/

GNU General Public License v3.0

13 stars 5 forks source link

Pull test URLs from commoncrawl.org #240

Open dantheta opened 6 years ago

dantheta commented 6 years ago

http://commoncrawl.org/ - searchable by cctld.

JimKillock commented 6 years ago

This looks great :)

JimKillock commented 5 years ago

Could we escalate this import and load it into the Test scheduler?

JimKillock commented 5 years ago

Flagging this as a good next step

edjw commented 5 years ago

I think what we want is probably here but it's 218GB https://commoncrawl.s3.amazonaws.com/projects/url-index/url-index.1356128792

More information here: https://commoncrawl.org/2013/01/common-crawl-url-index/

dantheta commented 5 years ago

I did import a load of commoncrawl data (just at domain level, not individual page).

It was imported through the generic CSV importer having pre-processed some of the index files. It's repeatable (as a manual exercise), but not automated or scripted.

The CC data I was using was at https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/cc-index.paths.gz

JimKillock commented 5 years ago

Is the dataset available for testing? I couldn't see it as available for test scheduling.