Closed anjackson closed 5 years ago
It now seems clearing the FrontierGroup FetchStats was conflicting with the tallies, so I've taken that out. Only the server or host stats are cleared now.
This appears to be working, although now the issue is that although we can force seeds to refresh, if the seed does e.g. a HTTP-to-HTTPS redirect the HTTPS is not refreshed on the shorter time-scale, but falls back on the sheet refresh frequency, which is still just an hour of delay tolerance.
If we make it much shorter, e.g. 12 hours, then pages will likely be revisited and the revisits will creep forward. Seed resets only reset the seed, not the rest. It's unclear how well this will work.
An alternative implementation would be to add a propagating annotation that is the launch date. If the discovered URL has not been visited since the launch date, it's accepted. For 'refresh' we could use the same mechanism but NOT inherit the launch date to derived URLs. This would constantly 'reset the clock'.
This can't be done an an annotation, but an inheritable property of a launch date can be done and is implemented in 695f5b0a2e99e88290bbfe65d11eb2839329212e
Getting this sorted out, along with preventing prerequisites from being blocked by quotas, should mean that we can run for a while and show that the quota resets are working now.
Quota clearing appears to be okay now. To keep things simple, the clearing system avoids touching the FrontierGroup
quotas (to avoid potential race-conditions) and we've switch to using Host
instead of Server
quotas as this avoid the need to clear quotas for both the http and https versions of a given URL.
The quota reset logic was not sufficient. It cleared the quota for a specific host, but failed to do the same for 'aliases'. Specifically, if the seed uses the
http
scheme, but immediately redirects, the server quota forhost:443
has not been reset, and this blocks the download of therobots.txt
, which in turn prevents anything else being downloaded (silently discarded with a-61
status code - as we're enqueuing directly rather than via Kafka they won't be noted asdiscarded
unless we modify the Candidates Chain).Added logic to 2.3.5 to both clear the server quotas, but also to switch to host quotas.
NOTE that this does not cope with other aliases, e.g. www./www#./no-www, so perhaps we need to add some more logic? One is to allow
resetQuotas
to be inherited through pre-requisites or redirects from the seed. The simplest is perhaps just to clearly report it and use the aliases from W3ACT when we launch, which curators can then update as needed.