ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Ensure quota resets work with server quotas #50

Open anjackson opened 4 years ago

anjackson commented 4 years ago

We switch to host quotas to make resetting easier, but as all DNS requests get allocated to a host label of dns: this means we can run out of DNS quota!

So, better to switch back to server quotas, but this need to address the earlier problems arising because the seed and/or it's pre-requisites redirect to a different server. i.e. if we have a seed of http://example.org/, we may get a P http://example.org/robots.txt then PR https://example.org/robots.txt which then gets blocked by the quota for example.org:443 long before even the example.org:80 quota gets reset.

An idea would be to propagate the resetQuotas annotation to prerequisites including via redirects. But the preconditions system works differently than the usual link extraction, skipping the rest of the fetch chain, so it'll have to be handled elsewhere.

The fullVia is set when the candidate chain is run, but are pre-requisites passed through the candidate chain? Ah, yes, getPrerequisiteUri is only called in CandidatesProcessor which calls runCandidateChain on it.

So, the simplest approach is to add a processor to the candidate chain that checks the via and propagates any resetQuotas annotation pre-requisites or redirects.

anjackson commented 4 years ago

Okay, this needs verifying on the frequent crawl stream, but looks good in small-scale tests.