Closed cameronbraid closed 6 years ago
You can measure request duration by overwriting the default transport of the collector: https://godoc.org/github.com/gocolly/colly#Collector.WithTransport . I'd keep the sleep after the OnRequest, because its length can be useful when you debug parallel scrapers. Perhaps the best would be some kind of integration with https://godoc.org/net/http/httptrace
Thanks, I'll look into Collector.WithTransport and httptrace
I've identified an issue with using httptrace, let me provide some context.
I made a patch to colly.Request to add a ClientTrace *httptrace.ClientTrace
member to the struct, therfore allowing an OnRequest callback to provide the trace hooks. I then modified
func (c *Collector) scrape
after c.handleOnRequest(request)
to update the request with the trace hooks if one is provided :
if request.ClientTrace != nil {
req = req.WithContext(httptrace.WithClientTrace(req.Context(), request.ClientTrace))
}
Now this works great for writing a hook to store the start time of the http request :
c.OnRequest(func(r *colly.Request) {
r.ClientTrace = &httptrace.ClientTrace{
GotConn: func(info httptrace.GotConnInfo) {
r.Ctx.Put("start", time.Now())
},
}
})
Unfortunately httptrace doesn't provide a hook for 'the response has ended' . Therefore I had to use the OnResponse callback to record the duration :
c.OnResponse(func(r *colly.Response) {
start := r.Ctx.GetAny("start")
if start != nil {
duration := time.Now().Sub(start.(time.Time))
fmt.Println(duration, r.URL.String())
}
})
The issue with this approach is that the time.Sleep is done in a defer function which executes before func (h *httpBackend) Do(...)
returns - causing the OnResponse hander to be called after the sleep
So from this point I saw some alternatives
a) patch httptrace to provide a RequestDone hook
b) introduce a new callback in colly to be called before the sleep is called
c) change the colly sleep function to run in a goroutine instead of defer - this breaks your ability to debug parallel scrapers
d) change the colly sleep function to run before h.Client.Do(request)
is called
One observation in your docs in limit rule :
// Delay is the duration to wait before creating a new request to the matching domains
Delay time.Duration
This leads me to believe that implementing (d) would bring the implementation closed to the docs ? Or am I missing something ?
This leads me to believe that implementing (d) would bring the implementation closed to the docs ? Or am I missing something ?
Well, the network communication is currently started after the sleep, so I think the documentation is accurate, because the sleep in a request's defer affects only the following requests and they wait because r.WaitChan <- true
(http_backend.go:L152) blocks while the previous request's defer sleeps.
Isn't the custom round tripper solve your problem?
You're right, a custom RoundTripper is the solution
Thank you.
func (t *transport) RoundTrip(req *http.Request) (res *http.Response, err error) {
start := time.Now()
defer func() {
fmt.Println(res.StatusCode, time.Now().Sub(start), req.URL.String())
}()
return http.DefaultTransport.RoundTrip(req)
}
@cameronbraid BTW, you can replace time.Now().Sub(start)
with time.Since(start)
I'm doing some timing during a crawl recording the start time in Ctx in OnRequest and calculating a duration in OnResponse. It all works very well until I try to throttle the crawler with a Limit with Delay since the sleep is called by the backend after the OnRequest callbacks
Is there another callback that can be used, or would you consider moving the sleep to before the OnRequest is called ?