Closed JimKillock closed 6 years ago
If pre-cleaning is the way to do, we should add in the EU probes and work through the commoncrawl data, maybe first, as it may contain court order blocked URLs we are not aware of
The import process already cleans out unresolvable domains. We could do with adding more processors to the robots.txt checker and the metadata retriever - I think that's where the time would go for this ticket. There's already some changes to allow the robots.txt checker to run on the trusted clean lines, it just needs some testing.
Sounds a plan.
Can we add .com data to the scheduler?
Thought: is it worth running through the .com data across the clean lines (including volunteer lines) to clean the data of non-functioning domains before doing the full test?