mattsse / chromiumoxide

Chrome Devtools Protocol rust API
Apache License 2.0
787 stars 81 forks source link

`<iframe>` elements stop page from loading #163

Closed escritorio-gustavo closed 1 year ago

escritorio-gustavo commented 1 year ago

I am trying to visit a page that uses Google's RecaptchaV3. The issue I'm having is the following: for some reason, the equest sent by the src attribute of the <iframe> element gets stuck on pending, on the content download phase, causing page.goto to fail.

If, for any reason, the rust code panics and the browser survives as a zombie process, everything imediately goes back to normal, i.e. the request resolves as soon as Rust crashes.

Below are screenshots of the problem and a minimal reproduction of the code I'm using:

use std::{path::Path, time::Duration};

use chromiumoxide::{Browser, BrowserConfig, BrowserFetcher, BrowserFetcherOptions};
use futures::StreamExt;

#[tokio::main]
async fn main() {
    let download_path = Path::new("./download");
    let _ = std::fs::create_dir_all(download_path);

    let fetcher_config = BrowserFetcherOptions::builder()
        .with_path(download_path)
        .build()
        .unwrap();

    let fetcher = BrowserFetcher::new(fetcher_config);

    let info = fetcher.fetch().await.unwrap();

    let config = BrowserConfig::builder()
        .chrome_executable(info.executable_path)
        .launch_timeout(std::time::Duration::from_secs(20))
        .with_head()
        .arg("--lang=en_US")
        .no_sandbox()
        .build()
        .unwrap();

    let (mut browser, mut handler) = Browser::launch(config).await.unwrap();

    let handle = tokio::spawn(async move {
        while let Some(h) = handler.next().await {
            if h.is_err() {
                break;
            }
        }
    });

    browser
        .new_page("https://esaj.tjms.jus.br/cpopg5/open.do")
        .await
        .unwrap();

    tokio::time::sleep(Duration::from_secs(50)).await;

    browser.close().await.unwrap();
    browser.wait().await.unwrap();
    handle.await.unwrap();
}

The weird thing about it is that it does have a 200 HTTP status code, even though it's still pending image

escritorio-gustavo commented 1 year ago

Notes:

escritorio-gustavo commented 1 year ago

This issue seems to be specifically caused by iframe elements. I have tested it on this codepen, and otained the same result. Loading of the page stops, the maps don't show up, and crashing the rust process gets everything back to normal

escritorio-gustavo commented 1 year ago

Please, if anyone can help, I really need this to work

escritorio-gustavo commented 1 year ago

Not sure yet, but it seems this issue doesn't happen in headless mode

chirok11 commented 1 year ago

@escritorio-gustavo If I understand your issue correctly, the problem is that there's a small bug that isn't being handled in chromiumoxide, a bug in Chromium itself. You could try forking the repository yourself, and in the file src/handler/target.rs, on line 521, change "auto_attach" to false and let me know if that solves the problem.

escritorio-gustavo commented 1 year ago

@chirok11 This does indeed work, the pages open successfully even in non-headless mode, thank you so much.

Do you have any idea what this chromium bug is and why this auto attach causes it to happen?

Also, does disabling auto attach cause any other side effects (i.e. can I submit a pull request to fix this close this issue without breaking the whole library?)

chirok11 commented 1 year ago

@escritorio-gustavo

That could be fixed since other libraries like puppeteer or playwright is working correctly within these flags. If you wish to fix it, then take a look for target auto attach handling in other libraries. I saw links to crbug.com yesterday, but it is not yet solved for a while. I am trying to discover issue, and supposing now that chrome is not auto attaching to iframes or service workers and that's why they're stuck and disconnect is solved. And client should attach manually or disconnect from these targets.

escritorio-gustavo commented 1 year ago

@chirok11

Thank you so much, I've been dealing with this issue for a couple of months now (I've been switching between this crate and headless_chrome, which is FAR slower)

I think I'll submit a pull request that adds this fix as a feature using the cfg macro. Again, thanks a lot

escritorio-gustavo commented 1 year ago

Hey @chirok11, did you close #173 because of #170? If so, let me know and I'll close #170 instead, as it is just a workaround using the method you mentioned with setting auto_attach to false.

I have absolutely zero knowledge on the subject and I'm willing to bet your PR provides a way better solution

chirok11 commented 1 year ago

I am still trying to fix it in correct way. I don't have much knowledge too, but trying to do best. I forked repository and has a branch cdp-fx. Could you try to reproduce issue on my branch? I kept auto attaching, but detaching from service workers and added runIfWaitingForDebugger. Most broken sites loading now, but I found an issue with some sites when goto future will always return timeout even website is loaded successfully. Anyway looks like iframes should be handled manually. I am not quite sure that it will not broke work with iframes (but really dont know does this library could work with iframes currently.)

escritorio-gustavo commented 1 year ago

I forked repository and has a branch cdp-fx. Could you try to reproduce issue on my branch?

Sure thing, I'll start testing right now.

In the meantime I'll leave #170 open. Once you figure out how to fix the problem you can simply delete the feature flag I added,

chirok11 commented 1 year ago

https://github.com/chirok11/chromiumoxide/tree/cdp-fx This is tree. Notice that I also changed Browser::connect and added HandlerConfig argument (to provide custom request timeout)

escritorio-gustavo commented 1 year ago

I've tested a couple of pages that used to give me trouble and things worked fine. Haven't bumped into the goto problem yet.

Do you have an example URL?

escritorio-gustavo commented 1 year ago

Also it seems like not all iframes will trigger the problem, for instance, in this version of the crate this W3Schools page loads fine, but this CodePen doesn't.

Both pages load with your fork though

chirok11 commented 1 year ago

Load, and you future is finished? I mean goto doesn't fall on timeout?

escritorio-gustavo commented 1 year ago

Found the goto problem on the MDN docs for iframe

This url causes a timeout with both your version and mine, and the unmodified crate has the original problem with the iframe getting stuck and the page never loading

chirok11 commented 1 year ago

Thanks! Will try to work on it. financialexpress.com also won't load and causes a timeout.

escritorio-gustavo commented 1 year ago

I noticed the references to puppeteer you mentioned in your fork mention service workers being a problem. The MDN page doesn't seem to have any, but financialexpress does. Do you think this could be related?

chirok11 commented 1 year ago

I think there is something that links them, but it's definitely not Service Worker, as it is indeed not on MDN. There is something common between these two websites, but so far I only see an iframe as a similarity. I have a feeling that with certain website behavior (the presence of an iframe), the logic in CommandFuture breaks, and it simply does not receive a response that the page has loaded (even though it should, as the page is actually loaded correctly).

chirok11 commented 1 year ago

OK, problem is in frame lifecycle, some websites does not emit "load" event (bug in chromiumoxide event handling? dunno now). If we change expected_lifecycle in src/handler/frame.rs#L617 to "networkIdle" instead of "load" we'll have successfully loading pages without failing them on timeout.

chirok11 commented 1 year ago

Alright, I've figured it out. It turns out that the issue was, first of all, that check_lifecycle was waiting for a load event from the main frame and all child frames. However, on problematic sites, I encountered child frames with url: None, which will never emit a "load" event. Interestingly, if you increase the wait time to several seconds after creating the page and then navigate to the page, there's a chance that everything will go smoothly; so, the problem is intermittent. But if you remove the pause between creating the page and navigating, almost a hundred percent of the time there will be a timeout error. I've slightly modified the check_lifecycle function in src/handler/frame.rs; I've added the following conditions: lifecycle_events contains "load", or if frame.url is None, at least check for the presence of DOMContentLoaded.

I think there should be an option to choose whether to wait for "load," "networkIdle," or "networkAlmostIdle," similar to other libraries working with the Chrome DevTools Protocol.

chirok11 commented 1 year ago

@escritorio-gustavo You could try to check MDN links or other and confirm that it is not failing by Timeout for now and other pages doesn't get broken.

escritorio-gustavo commented 1 year ago

@chirok11

This worked, the newest version of your fork properly loads the MDN page, financialexpress and the pages where I originally encountered the problem, I will close #170 in favor of this implementation

escritorio-gustavo commented 1 year ago

Hey @chirok11, can you add "this will fix #163" to your PR's description to link it to this issue? This way when it's accepted the issue should be automatically closed

I tried to do it as a comment but Github will only create the link if it's on the PR description or on a commit message

chirok11 commented 1 year ago

@escritorio-gustavo done.

escritorio-gustavo commented 1 year ago

Awesome! You did an awesome job! Thanks again for all the help with this issue

escritorio-gustavo commented 1 year ago

Btw, the issue #171 wasn't linked to the PR, as it requires the closes keyword in all issues to create the links, so you need to write "Closes #163 and closes #171"

chirok11 commented 1 year ago

Thank you, modified comment.