mattsse / chromiumoxide

Chrome Devtools Protocol rust API
Apache License 2.0
711 stars 69 forks source link

Does this support waitUntil: networkidle? #36

Open leaty opened 3 years ago

leaty commented 3 years ago

I saw references to NetworkIdle in the source, I wonder if it's supported yet to wait until network has been idle X amount of time. This is a huge benefit of puppeteer vs webdriver, especially for JS web crawlers, or to just simplify actions on dynamic content. If it is supported and I'm not blind/stupid- examples or documentation on this particular feature would really help boost this library I believe! :)

mattsse commented 3 years ago

I'm not exactly sure if I understood you correctly. You want to add a custom delay on top a navigation request? Right now there is Page::wait_for_navigation_response and Page::wait_for_navigation both return as soon as the networkidle event is received. If you want to wait even longer you would have to delay it yourself with a futures_timer::Delay for example.

But I believe a special Page::waitUntil(event:&str) method that takes in the event identifier would be a good addition. Is that would you were looking for?

leaty commented 3 years ago

Sorry, I should've been more specific. I was asking about something akin to the waitUntil argument in puppeteer's waitForNavigation, see link to docs.

Although I don't particularly appreciate the closed set of events in puppeteer, they're good enough. Take networkidle0 as an example, it would wait until there are 0 network connections for at least 500ms, while networkidle2 would wait the same amount but allow 2 active network connections. If that exists already, I've missed it.

If at all possible however, what I'm really out for is the ability to wait until there are X amount of active requests, for Y amount of time. If there's a way to get how many network connections there are I could implement the wait myself.

As a huge plus it would be really awesome if one could get a list of active network requests instead of just a count, with e.g. their URL, and/or its type of request- if known, like if it's XHR. This way you could also decide if waiting for a e.g. stylesheet is really necessary, and in my particular use case I'd ignore it because I only need to wait for dynamic content, i.e. ajax requests. I have no idea if this is a possibility at all though.

Sidenote: I couldn't find Page::wait_for_navigation_response in the docs.

mattsse commented 3 years ago

Right now wait_for_navigation waits until the frame is loaded, same as load in puppeteer, but it would be fairly easy to add the same options as the puppeteer wait_for_navigation function (networkidle0, networkidle2) because we only check whether the frame already received the corresponding lifecycle events.

Take networkidle0 as an example, it would wait until there are 0 network connections for at least 500ms, while networkidle2 would wait the same amount but allow 2 active network connections. If that exists already, I've missed it.

If I read the puppeteer source code right, these are just aliases ['networkidle0', 'networkIdle'], ['networkidle2', 'networkAlmostIdle'] LifecycleWatcher.ts.

If at all possible however, what I'm really out for is the ability to wait until there are X amount of active requests, for Y amount of time. If there's a way to get how many network connections there are I could implement the wait myself.

While it would be possible to get access to all the current active HTTPRequests in the [Networkmanager](https://github.com/mattsse/chromiumoxide/blob/main/src/handler/network.rs#L24), I'm not convinced that this is really beneficial.

As a huge plus it would be really awesome if one could get a list of active network requests instead of just a count, with e.g. their URL, and/or its type of request- if known, like if it's XHR.

you could subscribe to the events you're interested in on the Target, that should give you access to that information.

Sidenote: I couldn't find Page::wait_for_navigation_response in the docs.

Page::wait_for_navigation_response returns when the page is loaded and returns also the HttpResponse

So if I understood you correctly, you would like similar options in in the Page::wait_for_navigation methods? I agree that this would be a beneficial enhancement. But I'm not sure whether monitoring the all the different kinds of network requests for navigation timeouts makes sense. I think I understood most of your use case but not entirely sure, so I try to repeat it:

You're visiting a page that uses a lot of dynamic content, so the page would be considered loaded before the data you're interested in is loaded dynamically? And you want to monitor the active requests of the page to be in the loop when the page loads dynamic content? I think via a combination of event listeners and timeout this should be solvable. Maybe you could try some workarounds for your problem and if you think you found a good solution we could discuss how to get this feature into the codebase. I'm happy to help and support you on this.

leaty commented 3 years ago

Informative response, thank you!

you could subscribe to the events you're interested in on the Target, that should give you access to that information.

If I understood this right, I could subscribe to a "network" event, and catch all network requests as they complete? Or, is there also a way to catch them once they fire rather than when they finish? If so, then that already solves my usecase, I'd only suggest adding a hint on the docs to this because I do believe it's a great feature.

I think I understood most of your use case but not entirely sure, so I try to repeat it:

Yes, pretty close, but I'll try to explain why exactly.

Basically, to build a fully functional web crawler today, for some sites you need to be able to run javascript, otherwise you might miss out on certain dynamic content, or the ability to crawl at all because the website runs purely through dynamic content.

One way would be to implement a javascript engine, but then you also need the DOM aspect to make such javascript function- like google has done internally, I've decided against this because it's quite a bit of effort.

Another way to do this is through controlling a webdriver, but here's the problem: You have an unknown variable of time to wait for dynamic content, you can't just tell the webdriver to visit the page and start grabbing content immediately, because you'll lose out on a lot of content. To solve this, you need to force a wait for X amount of time, to hopefully catch all dynamic content. Some pages take over 3-5 seconds to load everything, some more, however most of course take way less. But because it's an unknown variable of time, you need to wait the same amount for every page. As you can imagine, this makes crawling javascript-based websites extremely slow. Although there is a mediocre solution to this- by inserting a MutationObserver script, but this has its own flaws.

The alternative would then be to use Chrome DevTools Protocol, as it has more access in terms of deciding when exactly the page, including dynamic content has finished loading. So, to speed up crawling to the utmost possibility here, you'd want to be able to tell when all important dynamic content has fully been loaded in. This way, some pages may take 3 seconds to crawl, some take only 100ms etc. It speeds it up a goooood amount.

I hope that makes it clearer, and I'd definitely be interested in hacking together a solution if it's possible, which could be implemented into this library in the end!

Sytten commented 1 year ago

I kinda have the same problem. I am grabbing a screenshot after doing a set_content, but it looks like it doesn't have the external resources (css) loaded. I am unsure why that is since the load should be emitted once the page is rendered.

j-mendez commented 4 months ago

@Sytten I am using the following in another app to replicate the event https://github.com/mattsse/chromiumoxide/pull/201