internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
669 stars 97 forks source link

feat: Detect connection failures forwarded from warcprox and retry th… #285

Open adam-miller opened 2 months ago

adam-miller commented 2 months ago

…em with backoff

adam-miller commented 2 months ago

Brozzler didn't really have a retry loop for failed pages except when connection to warcprox failed, and this too would fail in a tight loop. Warcprox connection failures and timeouts are returned to the browser as 502 and 504 status codes, so I'm checking for those and adding a retry loop with backoff. This is accomplished by adding a retry_after field to the page in rethinkdb, and then adjusting the query for claiming a page. This then causes a tight loop on claiming the site, so I add a delay there to avoid attempting to immediately claim a site that was just disclaimed.