ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Quotas reset on restart #76

Open anjackson opened 2 years ago

anjackson commented 2 years ago

Because the higher quotas are set via launch events, when restarting, these get lost, so right now the crawler is just dropping lots of URLs -5003. These would likely have been downloaded eventually otherwise.

This is an example of why crawl config should be handled differently. e.g. the tocrawl topic should be compacted against a per-seed key, and the whole topic re-read each time, so that the configuration is always up to date.