gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

trying to measure request timing #95

Closed cameronbraid closed 6 years ago

cameronbraid commented 6 years ago

I'm doing some timing during a crawl recording the start time in Ctx in OnRequest and calculating a duration in OnResponse. It all works very well until I try to throttle the crawler with a Limit with Delay since the sleep is called by the backend after the OnRequest callbacks

Is there another callback that can be used, or would you consider moving the sleep to before the OnRequest is called ?

asciimoo commented 6 years ago

You can measure request duration by overwriting the default transport of the collector: https://godoc.org/github.com/gocolly/colly#Collector.WithTransport . I'd keep the sleep after the OnRequest, because its length can be useful when you debug parallel scrapers. Perhaps the best would be some kind of integration with https://godoc.org/net/http/httptrace

cameronbraid commented 6 years ago

Thanks, I'll look into Collector.WithTransport and httptrace

cameronbraid commented 6 years ago

I've identified an issue with using httptrace, let me provide some context.

I made a patch to colly.Request to add a ClientTrace *httptrace.ClientTrace member to the struct, therfore allowing an OnRequest callback to provide the trace hooks. I then modified func (c *Collector) scrape after c.handleOnRequest(request) to update the request with the trace hooks if one is provided :

if request.ClientTrace != nil {
    req = req.WithContext(httptrace.WithClientTrace(req.Context(), request.ClientTrace))
}

Now this works great for writing a hook to store the start time of the http request :

c.OnRequest(func(r *colly.Request) {
  r.ClientTrace = &httptrace.ClientTrace{
    GotConn: func(info httptrace.GotConnInfo) {
      r.Ctx.Put("start", time.Now())
    },
  }
})

Unfortunately httptrace doesn't provide a hook for 'the response has ended' . Therefore I had to use the OnResponse callback to record the duration :

c.OnResponse(func(r *colly.Response) {
    start := r.Ctx.GetAny("start")
    if start != nil {
        duration := time.Now().Sub(start.(time.Time))
        fmt.Println(duration, r.URL.String())
    }
})

The issue with this approach is that the time.Sleep is done in a defer function which executes before func (h *httpBackend) Do(...) returns - causing the OnResponse hander to be called after the sleep

So from this point I saw some alternatives

a) patch httptrace to provide a RequestDone hook b) introduce a new callback in colly to be called before the sleep is called c) change the colly sleep function to run in a goroutine instead of defer - this breaks your ability to debug parallel scrapers d) change the colly sleep function to run before h.Client.Do(request) is called

One observation in your docs in limit rule :

// Delay is the duration to wait before creating a new request to the matching domains
Delay time.Duration

This leads me to believe that implementing (d) would bring the implementation closed to the docs ? Or am I missing something ?

asciimoo commented 6 years ago

This leads me to believe that implementing (d) would bring the implementation closed to the docs ? Or am I missing something ?

Well, the network communication is currently started after the sleep, so I think the documentation is accurate, because the sleep in a request's defer affects only the following requests and they wait because r.WaitChan <- true (http_backend.go:L152) blocks while the previous request's defer sleeps.

Isn't the custom round tripper solve your problem?

cameronbraid commented 6 years ago

You're right, a custom RoundTripper is the solution

Thank you.

func (t *transport) RoundTrip(req *http.Request) (res *http.Response, err error) {
    start := time.Now()
    defer func() {
        fmt.Println(res.StatusCode, time.Now().Sub(start), req.URL.String())
    }()
    return http.DefaultTransport.RoundTrip(req)
}
jimsmart commented 6 years ago

@cameronbraid BTW, you can replace time.Now().Sub(start) with time.Since(start)