Closed danielbicho closed 4 years ago
Thanks for the report. I filed a chromium issue about the Page.interstitialShown
bug:
https://bugs.chromium.org/p/chromium/issues/detail?id=1029878
As for your idea for working around this problem, I think it's a very good idea. We've had a number of cases where a page kept failing for one reason or another, and it's bad. We can end up with tons of duplicate captures, the crawl is not able to make progress, and the overall performance of the cluster is impacted in cases like yours, where a browser is sitting there doing nothing for five minutes. You've probably experience these impacts too.
I think the right way to implement the improvement more broad than what you have here, because we want to handle any kind of failure, not just ones that manifest in navigate_to_page()
. I think what we should do is add a field Page.failed_attempts
or something like that to the rethinkdb model; increment that when we catch an "unexpected" exception in BrozzlerWorker.brozzle_site()
, up to some limit (I'm thinking 3); and when we reach that limit, consider the page completed, even though it was never brozzled to completion.
@danielbicho could you check if https://github.com/internetarchive/brozzler/pull/184 solves this for you?
@nlevitt I have just tried and it solves the problem! And yeah the broader approach make way more sense! thank you!
Will close this so!
http://www.esf.org/esf_article.php?language=0&domain=1&activity=1&article=319&page=1059
While trying to brozzle the above page the job never finishes and hangs trying to brozzle that page.
After checking other issues and trying to figure it out it seems that the Page.interstitialShown is never fired. I have seen other reports here mentioning inconsistencies with the firing of this event.
In this specific case, it seems that the original page returns a 404 and it is a resource loaded by the 404 Not Found page that is requesting auth. If I try to directly brozzle one of these resource the Page.interstitialShown event is fired.
Neverless! Should not Brozzler have a limited number of times that tries to navigate_to_page and then let it go? Or maybe to have that option?