rebrowser / rebrowser-patches

Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on demand.
https://rebrowser.net
361 stars 27 forks source link

Crash on FrameManager.#onFrameNavigated() #65

Open mmatferneu opened 1 week ago

mmatferneu commented 1 week ago

sometimes (probably because of ads doing shenanigans in the page) the frameId is not in the frame tree. When that happens inside FrameManager.#onFrameNavigated(), the call to _frameTree.getById() will fail, the frame becomes undefined and the code will crash when it reaches here

https://github.com/rebrowser/rebrowser-puppeteer-core/blob/d1579ee8f34293503dddbbda837fc5ae786fb2bc/src/cdp/FrameManager.ts#L496

Seems to happen quite often, but I wasn't able to create a 100% fool proof reproduction to post here. I think it is because it depends on what ad the page decides to load. Also, and this might be related, I'm not sure why, but very often it seems that an ad gets loaded into a new tab, as if my code had clicked on it. But my code never clicks on anything. It just does

for(const url of urlList)
{
    await  page.goto(url);
    // extract all links
}

the crash always happens during one of the goto calls. After the first one. It never seems to happen during the first one.

MaxLiebsch commented 3 days ago

Dear, in general I would avoid a for loop for opening pages. Pages can have insanely long loading times and even so everything should be async. You never know what all can go wrong in the process. A node process catch all and error recovery is a must.

For queue handling I would suggest a solution where you use something like PQueue or Puppeteer Cluster or you build a proper queue yourself.

I hope that helps. Happy coding.

mmatferneu commented 1 day ago

Thanks for the suggestions. I agree with all of them. So much, in fact, that I was already doing all of that.

Unfortunately none of that matters for this particular bug. Please re-read the first paragraph and look at the source code. The code assumes the frame variable will always be valid, which is not true and that is why the whole thing crashes.

It is exactly because pages can have an insanely long loading time that I do everything "synchronously", that is, I "await" on every async call. So, even though I'm loading pages inside a for loop, there is never more than one page being loaded at any time. Never. Only after a page is fully processed I go to the next one. All still in the same browser tab.

The problem is that some seems to load ads that can fake they've been clicked and suddenly I start seem new tabs being opened. And it seems that is what causes the crash to occur.

As for error handling, every single page is processed inside one big try-catch block. For some reason, this particular bug does not seems to throw an exception that could be caught in the catch block. Not sure if it is intended or not, but the moment rebrowser code tries to access the undefined variable, the whole process simply crash.

MaxLiebsch commented 1 day ago

You are right you won't catch those with try catch block since it is throw outside the scope of your function. That's why I suggested process catch all and handling. Probably I was not clear enough on that point.

process.on("uncaughtException".... 

It might feel like a hack but with puppeteer you will have these exceptions that bubble up in the event loop. If you find a better way let me know.

As for your initial question., do you have more code to spare and a page where it occurs otherwise might be difficult to spot any issue from your text only.