Open elpiel opened 1 year ago
Hi @elpiel, can you please share your code? It would be useful to understand the flow of it (I see two distinct problems here).
About parallelism, during my own experiments I've noticed that it gets really slow as you handle more tabs in the same browser instance.. maybe we can create ad-hoc examples of parallel scraping and some performance analysis? We could find potential bottlenecks.. @Billy-Sheppard
I haven't had much experience with how the tab code works - perhaps its generally good practice to use a new browser instance over having many tabs in the meantime.
I'm definitely open to merging in some examples that show potential bottlenecks.
I'd like to chime in here. I did a test: I loaded 5 browsers I loaded 2 tabs in each browser and started navigating without waiting for navigation Then, starting from the first tab, I waited for navigation and output a pdf for each tab ~10-11 sec
I loaded 5 browsers I loaded 2 tabs in each browser, waited for navigation as they were created and created a pdf ~27-31sec
I loaded 1 browser I loaded 10 tabs in the browser, waited for navigation as they were created and created a pdf ~27-31sec
Conclusion
wait_until_navigated
blocks - and is evident in the code.
Solution
expose Tab.navigating
as Tab.is_navigating
method so we can loop through browsers and their tabs to see which ones finish first. In this way, we are not blocking while one tab navigates.
Use Case: Create a vec of tabs and a reference to the browser they belong to (create tabs as needed). Each tab will async load/render as they are created (high volume). App loops thru all tabs to see which one finishes first, renders to PDF and closes the tab. Repeat.
Each browser is concurrent PDF renderer and each tab is a concurrent navigator. ie, x concurrent pdfs == x browsers, but the tabs are ready to go async in the background.
Now I'm seeing that I have access to the same events that set the variable. This should work great without modification
It would be nice to improve on this blocking behaviour in the long run.
Thank you @adrian-pc-code for adding more details on the issue!
Also, if you have an example how to improve this in the current setup (with the events you mentioned) I would love to see the example.
@adrian-pc-code Interesting. Trying your approach converting many documents in a row leaves me with this error at some point: "Method call error -32602: No session with given id". I build the browser like this (so it should not be timeout related):
let options = LaunchOptionsBuilder::default()
.sandbox(false)
.idle_browser_timeout(std::time::Duration::from_secs(6000))
.build()
.expect("Failed to build browser options,...");
Any idea why that session id is gone?
For people looking for a solution, you can use browser.new_context().new_tab()
to create a new tab instead of creating a new brower instance. (see https://github.com/rust-headless-chrome/rust-headless-chrome/issues/340#issuecomment-1312655186)
My case of using this library is to generate multiple PDFs and send them via Email. The problem that I have is that no matter what I do, the browser always:
wait_until_navigated
)Another issue that I have right now is also related to the Browser. I keep it in an App state inside
axum::State
but after generating some PDFs after a bit of time it gives a timeout error and no new tabs can be generated afterwards: