ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Ensure quotas are cleared properly #29

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

The quota reset logic was not sufficient. It cleared the quota for a specific host, but failed to do the same for 'aliases'. Specifically, if the seed uses the http scheme, but immediately redirects, the server quota for host:443 has not been reset, and this blocks the download of the robots.txt, which in turn prevents anything else being downloaded (silently discarded with a -61 status code - as we're enqueuing directly rather than via Kafka they won't be noted as discarded unless we modify the Candidates Chain).

Added logic to 2.3.5 to both clear the server quotas, but also to switch to host quotas.

NOTE that this does not cope with other aliases, e.g. www./www#./no-www, so perhaps we need to add some more logic? One is to allow resetQuotas to be inherited through pre-requisites or redirects from the seed. The simplest is perhaps just to clearly report it and use the aliases from W3ACT when we launch, which curators can then update as needed.

anjackson commented 5 years ago

It now seems clearing the FrontierGroup FetchStats was conflicting with the tallies, so I've taken that out. Only the server or host stats are cleared now.

anjackson commented 5 years ago

This appears to be working, although now the issue is that although we can force seeds to refresh, if the seed does e.g. a HTTP-to-HTTPS redirect the HTTPS is not refreshed on the shorter time-scale, but falls back on the sheet refresh frequency, which is still just an hour of delay tolerance.

If we make it much shorter, e.g. 12 hours, then pages will likely be revisited and the revisits will creep forward. Seed resets only reset the seed, not the rest. It's unclear how well this will work.

An alternative implementation would be to add a propagating annotation that is the launch date. If the discovered URL has not been visited since the launch date, it's accepted. For 'refresh' we could use the same mechanism but NOT inherit the launch date to derived URLs. This would constantly 'reset the clock'.

anjackson commented 5 years ago

This can't be done an an annotation, but an inheritable property of a launch date can be done and is implemented in 695f5b0a2e99e88290bbfe65d11eb2839329212e

anjackson commented 5 years ago

Getting this sorted out, along with preventing prerequisites from being blocked by quotas, should mean that we can run for a while and show that the quota resets are working now.

anjackson commented 5 years ago

Quota clearing appears to be okay now. To keep things simple, the clearing system avoids touching the FrontierGroup quotas (to avoid potential race-conditions) and we've switch to using Host instead of Server quotas as this avoid the need to clear quotas for both the http and https versions of a given URL.