Pre-clean .com data - Githubissues

openrightsgroup / blocked-org-uk

Template front-end code, markup, style-sheets, images and other assets for the Censorship Monitoring Project (blocked.org.uk)

https://www.blocked.org.uk/

GNU General Public License v3.0

13 stars 5 forks source link

Pre-clean .com data #260

Closed JimKillock closed 6 years ago

JimKillock commented 6 years ago

Thought: is it worth running through the .com data across the clean lines (including volunteer lines) to clean the data of non-functioning domains before doing the full test?

JimKillock commented 6 years ago

If pre-cleaning is the way to do, we should add in the EU probes and work through the commoncrawl data, maybe first, as it may contain court order blocked URLs we are not aware of

dantheta commented 6 years ago

The import process already cleans out unresolvable domains. We could do with adding more processors to the robots.txt checker and the metadata retriever - I think that's where the time would go for this ticket. There's already some changes to allow the robots.txt checker to run on the trusted clean lines, it just needs some testing.

JimKillock commented 6 years ago

Sounds a plan.

JimKillock commented 6 years ago

Can we add .com data to the scheduler?