rust-headless-chrome / rust-headless-chrome

A high-level API to control headless Chrome or Chromium over the DevTools Protocol. It is the Rust equivalent of Puppeteer, a Node library maintained by the Chrome DevTools team.
MIT License
2.3k stars 225 forks source link

Running new tabs in parallel is always executing sequentially #376

Open elpiel opened 1 year ago

elpiel commented 1 year ago

My case of using this library is to generate multiple PDFs and send them via Email. The problem that I have is that no matter what I do, the browser always:

  1. creates the a new tab
  2. waits for it to finish loading ( I also explicitly call wait_until_navigated)
  3. call print_to_pdf
  4. Start the process for the next pdf generation

Another issue that I have right now is also related to the Browser. I keep it in an App state inside axum::State but after generating some PDFs after a bit of time it gives a timeout error and no new tabs can be generated afterwards:

2023-02-17T20:22:44Z ERROR headless_chrome::browser] Got a timeout while listening for browser events (Chrome #Some(1079118))
[2023-02-17T20:22:44Z INFO  headless_chrome::browser] Finished browser's event handling loop
[2023-02-17T20:22:46Z ERROR headless_chrome::browser::transport] Transport loop got a timeout while listening for messages (Chrome #Some(1079118))
[2023-02-17T20:22:57Z ERROR pdf_results::send] Failed to generate pdf .....
masc-it commented 1 year ago

Hi @elpiel, can you please share your code? It would be useful to understand the flow of it (I see two distinct problems here).

About parallelism, during my own experiments I've noticed that it gets really slow as you handle more tabs in the same browser instance.. maybe we can create ad-hoc examples of parallel scraping and some performance analysis? We could find potential bottlenecks.. @Billy-Sheppard

Billy-Sheppard commented 1 year ago

I haven't had much experience with how the tab code works - perhaps its generally good practice to use a new browser instance over having many tabs in the meantime.

I'm definitely open to merging in some examples that show potential bottlenecks.

adrian-pc-code commented 1 year ago

I'd like to chime in here. I did a test: I loaded 5 browsers I loaded 2 tabs in each browser and started navigating without waiting for navigation Then, starting from the first tab, I waited for navigation and output a pdf for each tab ~10-11 sec

I loaded 5 browsers I loaded 2 tabs in each browser, waited for navigation as they were created and created a pdf ~27-31sec

I loaded 1 browser I loaded 10 tabs in the browser, waited for navigation as they were created and created a pdf ~27-31sec

Conclusion wait_until_navigated blocks - and is evident in the code.

Solution expose Tab.navigating as Tab.is_navigating method so we can loop through browsers and their tabs to see which ones finish first. In this way, we are not blocking while one tab navigates.

Use Case: Create a vec of tabs and a reference to the browser they belong to (create tabs as needed). Each tab will async load/render as they are created (high volume). App loops thru all tabs to see which one finishes first, renders to PDF and closes the tab. Repeat.

Each browser is concurrent PDF renderer and each tab is a concurrent navigator. ie, x concurrent pdfs == x browsers, but the tabs are ready to go async in the background.

adrian-pc-code commented 1 year ago

Now I'm seeing that I have access to the same events that set the variable. This should work great without modification

elpiel commented 1 year ago

It would be nice to improve on this blocking behaviour in the long run.

Thank you @adrian-pc-code for adding more details on the issue!

Also, if you have an example how to improve this in the current setup (with the events you mentioned) I would love to see the example.

inzanez commented 1 year ago

@adrian-pc-code Interesting. Trying your approach converting many documents in a row leaves me with this error at some point: "Method call error -32602: No session with given id". I build the browser like this (so it should not be timeout related):

    let options = LaunchOptionsBuilder::default()
        .sandbox(false)
        .idle_browser_timeout(std::time::Duration::from_secs(6000))
        .build()
        .expect("Failed to build browser options,...");

Any idea why that session id is gone?

mirsella commented 10 months ago

For people looking for a solution, you can use browser.new_context().new_tab() to create a new tab instead of creating a new brower instance. (see https://github.com/rust-headless-chrome/rust-headless-chrome/issues/340#issuecomment-1312655186)