gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.39k stars 1.77k forks source link

Async and queue #676

Open billiboy opened 2 years ago

billiboy commented 2 years ago

Can queue and async be used together in colly? I don't quite understand what queues are for.

other question, should i extract more categories from one site and i want to increase the speed, should i use more scrapers or change the number of parallels? I'm currently doing it like this: var ( urls = []string{ "https://url/annunci-italia/vendita/telefonia/?ps=150", "https://url/annunci-italia/vendita/informatica/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/fotografia/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/audio-video/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/videogiochi/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/arredamento-casalinghi/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/elettrodomestici/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/giardino-fai-da-te/?order=priceasc&ps=100",

}

)

func main() { var wg sync.WaitGroup for _, u := range urls { wg.Add(1) go Scraper.Crawler(true, u, &wg) } wg.Wait() }

"Scraper Function":

`c := colly.NewCollector(

    colly.MaxDepth(30),
    colly.Async(true),
)
c.Limit(&colly.LimitRule{
    Parallelism: 100,
    RandomDelay: 6 * time.Second,
})
c.SetRequestTimeout(120 * time.Second)
c.WithTransport(&http.Transport{
    DisableKeepAlives: true,
})

c.OnHTML("a.SmallCard-module_link__9Ey4a.link", func(e *colly.HTMLElement) {

    l := e.Attr("href")

    if l != "" {
                 fmt.Println("Url", l)
    }

})

c.OnHTML(`a.index-module_link__PZ2VK.index-module_outline__2EfuB.index-module_medium__2lAkR.pagination_arrow-button__Y0iWq`, func(e *colly.HTMLElement) {

    e.Request.Visit(e.Attr("href"))

})

c.OnError(func(r *colly.Response, err error) {

        fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)

})

c.Visit(url)
c.Wait()
 wg.Done()`

I can't tell if I'm doing it right, I'm not happy enough with the speed. what do you advise me to do? I would like to make it performant and stable

jonesrussell commented 11 months ago

Hi @billiboy, have you figured it out? care to share?

Happy New Year!

gtors commented 3 months ago

Don't use async mode with a queue, as it may cause your queue.Run to exit too early.

https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/queue/queue.go#L150-L156 https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/queue/queue.go#L179-L184 https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/queue/queue.go#L190-L193 https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/colly.go#L573-L577

Gotchas:

https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/colly.go#L368-L372