Open billiboy opened 2 years ago
Hi @billiboy, have you figured it out? care to share?
Happy New Year!
Don't use async mode with a queue, as it may cause your queue.Run to exit too early.
https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/queue/queue.go#L150-L156 https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/queue/queue.go#L179-L184 https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/queue/queue.go#L190-L193 https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/colly.go#L573-L577
Gotchas:
colly.Async(true)
- no matter whether it is true or false, if you pass colly.Async(...)
it will always set the async mode (this has been fixed in the master branch, but it has an impact on version 2.1.0).https://github.com/gocolly/colly/blob/bbf3f10c37205136e9d4f46fe8118205cc505a67/colly.go#L368-L372
Can queue and async be used together in colly? I don't quite understand what queues are for.
other question, should i extract more categories from one site and i want to increase the speed, should i use more scrapers or change the number of parallels? I'm currently doing it like this: var ( urls = []string{ "https://url/annunci-italia/vendita/telefonia/?ps=150", "https://url/annunci-italia/vendita/informatica/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/fotografia/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/audio-video/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/videogiochi/?order=priceasc&ps=30", "https://url/annunci-italia/vendita/arredamento-casalinghi/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/elettrodomestici/?order=priceasc&ps=50", "https://url/annunci-italia/vendita/giardino-fai-da-te/?order=priceasc&ps=100",
)
func main() { var wg sync.WaitGroup for _, u := range urls { wg.Add(1) go Scraper.Crawler(true, u, &wg) } wg.Wait() }
"Scraper Function":
`c := colly.NewCollector(
I can't tell if I'm doing it right, I'm not happy enough with the speed. what do you advise me to do? I would like to make it performant and stable