internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
669 stars 97 forks source link

Limit the number of times we try to navigate_to_page, then let it go #183

Closed danielbicho closed 4 years ago

danielbicho commented 4 years ago

http://www.esf.org/esf_article.php?language=0&domain=1&activity=1&article=319&page=1059

While trying to brozzle the above page the job never finishes and hangs trying to brozzle that page.

After checking other issues and trying to figure it out it seems that the Page.interstitialShown is never fired. I have seen other reports here mentioning inconsistencies with the firing of this event.

In this specific case, it seems that the original page returns a 404 and it is a resource loaded by the 404 Not Found page that is requesting auth. If I try to directly brozzle one of these resource the Page.interstitialShown event is fired.

Neverless! Should not Brozzler have a limited number of times that tries to navigate_to_page and then let it go? Or maybe to have that option?

nlevitt commented 4 years ago

Thanks for the report. I filed a chromium issue about the Page.interstitialShown bug: https://bugs.chromium.org/p/chromium/issues/detail?id=1029878

As for your idea for working around this problem, I think it's a very good idea. We've had a number of cases where a page kept failing for one reason or another, and it's bad. We can end up with tons of duplicate captures, the crawl is not able to make progress, and the overall performance of the cluster is impacted in cases like yours, where a browser is sitting there doing nothing for five minutes. You've probably experience these impacts too.

I think the right way to implement the improvement more broad than what you have here, because we want to handle any kind of failure, not just ones that manifest in navigate_to_page(). I think what we should do is add a field Page.failed_attempts or something like that to the rethinkdb model; increment that when we catch an "unexpected" exception in BrozzlerWorker.brozzle_site(), up to some limit (I'm thinking 3); and when we reach that limit, consider the page completed, even though it was never brozzled to completion.

nlevitt commented 4 years ago

@danielbicho could you check if https://github.com/internetarchive/brozzler/pull/184 solves this for you?

danielbicho commented 4 years ago

@nlevitt I have just tried and it solves the problem! And yeah the broader approach make way more sense! thank you!

Will close this so!