gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.24k stars 1.76k forks source link

Go coroutines are created but blocked by parallelism logic #696

Open williamjulianvicary opened 2 years ago

williamjulianvicary commented 2 years ago

I was debugging some logic I was adding to an OnRequest (stopping a crawl at a timeout) and as I walked through the code I noticed something that felt a little off:

This method is invoked by a go coroutine everytime a URL is queued for crawling: https://github.com/gocolly/colly/blob/master/http_backend.go#L170

The wait channel blocks on an addition if the channel is full, and then releases after a given timeout after the request - that all makes sense... however this means if you have a long wait between requests you accumulate memory usage and gocoroutines that are all sat around and waiting for an open channel slot.

What do you think to adjusting the logic here, so the limiting happens outside of the http_backend, perhaps in collector.fetch instead? That would mean the onRequest() callbacks can happen immediately prior to crawling.

Alternatively, ignoring the overhead these coroutines have, instead could a callback be added that is immediately prior to the request being invoked? Rather than prior to an item entering the queue - where the delay could be quite extensive from that point?

WGH- commented 2 years ago

What do you think to adjusting the logic here, so the limiting happens outside of the http_backend, perhaps in collector.fetch instead? That would mean the onRequest() callbacks can happen immediately prior to crawling.

I believe OnRequest callbacks already happen before rate limiting takes place. In fetch, c.handleOnRequest(request) happens first, and c.backend.Cache(...), which calls Do, happens later.

And even then moving order of calls around wouldn't change anything wrt goroutines being blocked by rate limiter, no?

Alternatively, ignoring the overhead these coroutines have, instead could a callback be added that is immediately prior to the request being invoked? Rather than prior to an item entering the queue - where the delay could be quite extensive from that point?

What queue are you referring to? github.com/gocolly/colly/v2/queue, or implicit "queue" of goroutines blocked by the rate limiter?

WGH- commented 2 years ago

If you want to avoid lots of goroutines being blocked by rate limiter, you should just use github.com/gocolly/colly/v2/queue. I believe it tries to solve exactly this problem.