jakopako / goskyr

A configurable command-line web scraper written in go with auto configuration capability
GNU General Public License v3.0
33 stars 5 forks source link

Slow Pagination #245

Closed alucab closed 10 months ago

alucab commented 11 months ago

I have this simple scraper

scrapers:
  - name: discontinued
    renderJs: true
    url: https://www.hikvision.com/en/products/discontinued-products
    item: "li.layout4-content-wrapper"
    fields:
      - name: "title"
        location:
          selector: "h3.h3-seo"
      - name: status
        can_be_empty: true
        type: text
        location:
        - selector: .tag-eol
      - name: desc
        type: text
        location:
          - selector: h4.h4-seo
      - name: link
        type: url
        location:
          - selector: ".btn-details-link"
    paginator:
      location:
        selector: "#layout-pagination-wrapper > ul > li:nth-child(9) > span"
      max_pages: 3

It works correctly but it is dramatically slow. It takes at least 40 secs for every page and i don't understand why, as the pagination as the data is preloaded in the page and the refresh from the browser extremely quick

jakopako commented 11 months ago

Thanks for bringing this up. You're right, when setting renderJs: true, the scraper is pretty slow. This part of the code is not yet very optimized and there are a bunch of hardcoded timers ( e.g. here: https://github.com/jakopako/goskyr/blob/main/fetch/fetcher.go#L73 ) to make sure the site has actually loaded. I'll look into this in more detail as soon as I have time. Feel free to make some improvements yourself and make a pull request if you have some ideas :)

alucab commented 11 months ago

I will certainly give a look, but i have veeeeery limited experience with golang so i take it more as a learning opportunity than the possibility to concretely contribute in the short term.

I was giving a look to the function and the invocation and i don't see huge timers (1 or 5 secs max). As I said I am not expert but might be that the slowness comes because you are instantiating a chromedp headless browser for every call ?

Thanks for being so reactive !

jakopako commented 11 months ago

Sure, no worries!

Yeah, you might be totally right, it could very well be that the reinstantiation is the thing that takes so much time.

jakopako commented 11 months ago

I released a new version, 0.5.8, that should be a little better speed-wise. It reuses the same browser instance for multiple requests, and you can change the default (which is now 2) page load wait time with the key page_load_wait_sec. So your config could look something like:

scrapers:
  - name: discontinued
    renderJs: true
    page_load_wait_sec: 1
    url: https://www.hikvision.com/en/products/discontinued-products
...

It's still not very fast but it should already be better than before. Hope this helps!

jakopako commented 11 months ago

actually changed the parameter's name to page_load_wait and the unit to milliseconds (instead of seconds) in version 0.5.9

alucab commented 11 months ago

Beautiful!

I'll study your commit to learn more of the tool

jakopako commented 10 months ago

Closing this issue now. There are still ways to improve the scraping speed for dynamic pages but quite some improvement has already been achieved since this issue was opened. Issue #253 describes one potential further improvement.

@alucab feel free to open another issue if there's anything else that can be improved.