ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
10 stars 7 forks source link

Quota resets are not working because sheet association was broken for HTTPS #26

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

It seems the quota-clearing is not working. We see:

INFO: uk.bl.wap.crawler.frontier.KafkaUrlReceiver setSheetAssociations Setting sheets for https://(com,fourfourtwo,www,)/ to [recrawl-1day] [Sat Jan 19 09:00:43 GMT 2019]
INFO: uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageFrontierScheduler run Adding seed to crawl: https://www.fourfourtwo.com/ [Sat Jan 19 09:00:43 GMT 2019]
INFO: uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageFrontierScheduler resetQuotas Clearing down quota stats for https://www.fourfourtwo.com/ [Sat Jan 19 09:00:43 GMT 2019]

but then

2019-01-19T09:00:43.851Z -5003          - https://www.fourfourtwo.com/ - https://www.fourfourtwo.com/ unknown #262 - - tid:65838:https://www.fourfourtwo.com/ Q:serverMaxSuccessKb {}
anjackson commented 5 years ago

It should not be possible for this to be a race condition, but the possibility that this is some threading issue should not be ruled out immediately.

anjackson commented 5 years ago

A second problem arose, in that the HTTPS URLs were not being consistently coerced to HTTP when setting up the sheets, so sheets were not being applied to HTTPS URLs (e.g. recrawl frequency was annual for them).

anjackson commented 5 years ago

Quota resets appear to be working since fe2b35abd290616095d2ae1bc399c72854cbc8a1 but we need to review later on, once the HTTPS/HTTP sheet assignment is fixed.

anjackson commented 5 years ago

Yes, this looks good now. The resetting was fine, but the sheet association was broken for HTTPS.