Open anjackson opened 4 years ago
Some fixes in place, but can't test yet due to networking problems.
Okay, built and home and the changes so far seem solid enough! Made some updates, and tagged a 2.7.0-BETA-3
to roll out on crawler05
before tomorrow morning.
Note also changes in webrender-api
were made to make it try harder/wait longer when contacting Docker to run screenshotters. That was the root cause of many of the gaps, as the DNS + robots.txt + 3 x WebRender meant the crawl ran our of retries (only 5 were allowed, now upped to 10).
Okay, looking solid. Tomorrow, I'll analyse today's crawl activity and check things have improved, but it's looking good.
Definite improvement, but some other issues around launching came up (not all seeds marked as seeds, launch failures not caught). Twitter overload remains an issue, and we'll still need parallel queues at some point.
Now seen that lodging connect_failed
in OutbackCDX means that if the WebRender fails entirely, the URL is never recrawled because it's considered 'recently seen'. Modifying the OutbackCDX Persist Store hook to ignore HTTP transient errors.
Couple of issues.
Firstly, observing some:
So presumably there's some basic bug there.
Also, seeing some 'skipped' days for the odd seed. Correlated with failed WebRender events. Implies failed WebRender events are not falling back on H3 correctly.
Finally, it appears we have so many Twitter seeds, we're not getting through them all in a day! Need to use a sheet with parallel queues setup. See https://github.com/internetarchive/heritrix3/commit/bee53de8b123590e1163c5f338c4318f3349e8d2