ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Issues with launches causing NPE and WebRender fails no falling back on H3? #52

Open anjackson opened 4 years ago

anjackson commented 4 years ago

Couple of issues.

Firstly, observing some:

java.lang.NullPointerException
        at uk.bl.wap.crawler.prefetch.QuotaResetPropagationProcessor.shouldProcess(QuotaResetPropagationProcessor.java:22)
        at org.archive.modules.Processor.process(Processor.java:140)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:176)
        at uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageFrontierScheduler.run(KafkaUrlReceiver.java:545)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

So presumably there's some basic bug there.

Also, seeing some 'skipped' days for the odd seed. Correlated with failed WebRender events. Implies failed WebRender events are not falling back on H3 correctly.

Finally, it appears we have so many Twitter seeds, we're not getting through them all in a day! Need to use a sheet with parallel queues setup. See https://github.com/internetarchive/heritrix3/commit/bee53de8b123590e1163c5f338c4318f3349e8d2

anjackson commented 4 years ago

Some fixes in place, but can't test yet due to networking problems.

anjackson commented 4 years ago

Okay, built and home and the changes so far seem solid enough! Made some updates, and tagged a 2.7.0-BETA-3 to roll out on crawler05 before tomorrow morning.

anjackson commented 4 years ago

Note also changes in webrender-api were made to make it try harder/wait longer when contacting Docker to run screenshotters. That was the root cause of many of the gaps, as the DNS + robots.txt + 3 x WebRender meant the crawl ran our of retries (only 5 were allowed, now upped to 10).

anjackson commented 4 years ago

Okay, looking solid. Tomorrow, I'll analyse today's crawl activity and check things have improved, but it's looking good.

anjackson commented 4 years ago

Definite improvement, but some other issues around launching came up (not all seeds marked as seeds, launch failures not caught). Twitter overload remains an issue, and we'll still need parallel queues at some point.

anjackson commented 4 years ago

Now seen that lodging connect_failed in OutbackCDX means that if the WebRender fails entirely, the URL is never recrawled because it's considered 'recently seen'. Modifying the OutbackCDX Persist Store hook to ignore HTTP transient errors.