Closed alucab closed 10 months ago
Thanks for bringing this up. You're right, when setting renderJs: true
, the scraper is pretty slow. This part of the code is not yet very optimized and there are a bunch of hardcoded timers ( e.g. here: https://github.com/jakopako/goskyr/blob/main/fetch/fetcher.go#L73 ) to make sure the site has actually loaded. I'll look into this in more detail as soon as I have time. Feel free to make some improvements yourself and make a pull request if you have some ideas :)
I will certainly give a look, but i have veeeeery limited experience with golang so i take it more as a learning opportunity than the possibility to concretely contribute in the short term.
I was giving a look to the function and the invocation and i don't see huge timers (1 or 5 secs max). As I said I am not expert but might be that the slowness comes because you are instantiating a chromedp headless browser for every call ?
Thanks for being so reactive !
Sure, no worries!
Yeah, you might be totally right, it could very well be that the reinstantiation is the thing that takes so much time.
I released a new version, 0.5.8, that should be a little better speed-wise. It reuses the same browser instance for multiple requests, and you can change the default (which is now 2) page load wait time with the key page_load_wait_sec
. So your config could look something like:
scrapers:
- name: discontinued
renderJs: true
page_load_wait_sec: 1
url: https://www.hikvision.com/en/products/discontinued-products
...
It's still not very fast but it should already be better than before. Hope this helps!
actually changed the parameter's name to page_load_wait
and the unit to milliseconds (instead of seconds) in version 0.5.9
Beautiful!
I'll study your commit to learn more of the tool
Closing this issue now. There are still ways to improve the scraping speed for dynamic pages but quite some improvement has already been achieved since this issue was opened. Issue #253 describes one potential further improvement.
@alucab feel free to open another issue if there's anything else that can be improved.
I have this simple scraper
It works correctly but it is dramatically slow. It takes at least 40 secs for every page and i don't understand why, as the pagination as the data is preloaded in the page and the refresh from the browser extremely quick