gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.26k stars 1.76k forks source link

`c.SetRequestTimeout` Doesn't forcefully close request after time limit is exceed #644

Open JustinTimperio opened 3 years ago

JustinTimperio commented 3 years ago

Hi all, I am part of a research project called "Project-Lernaean". We are working on mapping and surveying the internet for both developers and security researchers. We have been using Colly quite successfully in our scans but recently I have run into a fairly obscure issue. My current guess is that honeypots create endless pages that are deadlocking a few jobs infinitely. I have set very aggressive timeouts which you can see below, but even still typically in a job of 10k IP's 4-15 will fully deadlock and c.Wait() never completes.

 // Create Colly Collector
        c := colly.NewCollector(
                colly.MaxBodySize(10e9),
                colly.DetectCharset(),
                colly.Async(true),
        )
        c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 500})
        c.SetRequestTimeout(5 * time.Second)
        c.WithTransport(&http.Transport{
                Proxy: http.ProxyFromEnvironment,
                DialContext: (&net.Dialer{
                        Timeout:   5 * time.Second,
                        KeepAlive: 5 * time.Second,
                        DualStack: true,
                }).DialContext,
                MaxIdleConns:          500,
                IdleConnTimeout:       5 * time.Second,
                TLSHandshakeTimeout:   5 * time.Second,
                ExpectContinueTimeout: 1 * time.Second,
        })

I would appreciate any or guidance that the community has on dealing with this issue. I am also happy to create a test repository that can duplicate this issue. If possible adding a c.ForceClose() function would be extremely helpful in the case that workers for whatever reason, are unable to terminate.

WGH- commented 3 years ago

Try to send SIGQUIT (Ctrl+). Go will dump goroutine stacks traces and exit. Stack traces might help figure out where they got stuck.

JustinTimperio commented 3 years ago
^\SIGQUIT: quit
PC=0x46d301 m=0 sigcode=128

goroutine 0 [idle]:
runtime.futex(0xec6450, 0x80, 0x0, 0x0, 0x7ffd00000000, 0x0, 0xadaf50, 0x0, 0x7ffd8e7e4268, 0x40c17f, ...)
        /usr/lib/go/src/runtime/sys_linux_amd64.s:579 +0x21
runtime.futexsleep(0xec6450, 0x0, 0xffffffffffffffff)
        /usr/lib/go/src/runtime/os_linux.go:44 +0x46
runtime.notesleep(0xec6450)
        /usr/lib/go/src/runtime/lock_futex.go:159 +0x9f
runtime.mPark()
        /usr/lib/go/src/runtime/proc.go:1340 +0x39
runtime.stopm()
        /usr/lib/go/src/runtime/proc.go:2301 +0x92
runtime.findrunnable(0xc000033000, 0x0)
        /usr/lib/go/src/runtime/proc.go:2960 +0x72e
runtime.schedule()
        /usr/lib/go/src/runtime/proc.go:3169 +0x2d7
runtime.goexit0(0xc004011980)
        /usr/lib/go/src/runtime/proc.go:3478 +0x1de
runtime.mcall(0x7f67cd18cce7)
        /usr/lib/go/src/runtime/asm_amd64.s:327 +0x5b

goroutine 1 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc00035c108)
        /usr/lib/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc00035c100)
        /usr/lib/go/src/sync/waitgroup.go:130 +0x65
github.com/gocolly/colly/v2.(*Collector).Wait(...)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/colly.go:812
plHttpCrawler/crawler.RunJob(0xc00035ced0, 0xc, 0x1f48, 0x58, 0xfa, 0x2710, 0xc00035cf18, 0x4, 0x1, 0x61280f41, ...)
        /home/conductor/plHttpCrawler/crawler/crawl.go:201 +0x56a
plHttpCrawler/crawler.Crawl(0xc00035ced0, 0xc, 0x1f48, 0x58, 0xfa, 0x2710, 0xc00035cf18, 0x4, 0x1, 0x61280f41, ...)
        /home/conductor/plHttpCrawler/crawler/crawl.go:37 +0x5d8
main.main()
        /home/conductor/plHttpCrawler/.BUILD_SOURCE/plHttpCrawler.go:40 +0x878

goroutine 849 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f67a647a5a0, 0x72, 0xffffffffffffffff)
        /usr/lib/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc0035e2698, 0x72, 0x200, 0x200, 0xffffffffffffffff)
        /usr/lib/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0035e2680, 0xc00451c000, 0x200, 0x200, 0x0, 0x0, 0x0)
        /usr/lib/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc0035e2680, 0xc00451c000, 0x200, 0x200, 0x7f67a6149ef0, 0xc000918e00, 0x0)
        /usr/lib/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc00021e910, 0xc00451c000, 0x200, 0x200, 0x0, 0x0, 0x0)
        /usr/lib/go/src/net/net.go:183 +0x91
net/http.(*readWriteCloserBody).Read(0xc000535890, 0xc00451c000, 0x200, 0x200, 0xc00451a035, 0xc00451c000, 0x7f67a6149ef0)
        /usr/lib/go/src/net/http/transport.go:2371 +0x102
net/http.(*cancelTimerBody).Read(0xc001384580, 0xc00451c000, 0x200, 0x200, 0x972700, 0x9c8101, 0xc00451c000)
        /usr/lib/go/src/net/http/client.go:939 +0x62
io.(*LimitedReader).Read(0xc0005358a8, 0xc00451c000, 0x200, 0x200, 0xc0005358a8, 0x0, 0xc000040400)
        /usr/lib/go/src/io/io.go:473 +0x63
io.ReadAll(0xae8480, 0xc0005358a8, 0x0, 0x0, 0xffffffffffffffff, 0xc000e7b3e0, 0x0)
        /usr/lib/go/src/io/io.go:633 +0xdf
io/ioutil.ReadAll(...)
        /usr/lib/go/src/io/ioutil/ioutil.go:27
github.com/gocolly/colly/v2.(*httpBackend).Do(0xc000071440, 0xc000918e00, 0x2540be400, 0xc0013fff38, 0x0, 0x0, 0x0)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/http_backend.go:209 +0x349
github.com/gocolly/colly/v2.(*httpBackend).Cache(0xc000071440, 0xc000918e00, 0x2540be400, 0xc001138f38, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/http_backend.go:134 +0x77
github.com/gocolly/colly/v2.(*Collector).fetch(0xc0000f6700, 0xc0000292e0, 0x19, 0xa40e5d, 0x3, 0x1, 0x0, 0x0, 0xc000ad2a50, 0xc000924210, ...)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/colly.go:653 +0x2b3
created by github.com/gocolly/colly/v2.(*Collector).scrape
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/colly.go:574 +0x3e5

rax    0xca
rbx    0xec6300
rcx    0x46d303
rdx    0x0
rdi    0xec6450
rsi    0x80
rbp    0x7ffd8e7e4230
rsp    0x7ffd8e7e41e8
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x0
r13    0x1c65
r14    0xadaf50
r15    0x0
rip    0x46d301
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
JustinTimperio commented 3 years ago

I've created the following repository to show the bug in action: https://github.com/Project-Lernaean/CollyTimeoutError

To reproduce use go run main.go and it will iterate through a list of 100 IP port combos. It will always endlessly hang on at least one IP.

dherbst commented 3 years ago

My best guess is this is related to https://github.com/golang/go/issues/27752#issuecomment-492677602 The Dialer.Deadline should be setting the connection deadline, but sometimes doesn't work. If you reduce the number of urls sometimes they all work, sometimes they do not. Maybe reducing the testcase to remove Colly and demonstrate the problem with http.Transport will help get to the underlying issue.

JustinTimperio commented 3 years ago

@dherbst Thanks for the advice. I had a feeling this was lower level than colly but I was hoping that it wasn't an issue with the HTTP package. I can modify my bug demonstration to not use colly and submit it to the Golang repo but I'll do some more testing before then.

dherbst commented 3 years ago

@JustinTimperio I rewrote your test without colly and I can reproduce the hanging issue with a subset of your url list here: https://github.com/dherbst/deadlinetest

JustinTimperio commented 3 years ago

Awesome work @dherbst! I can sumbit this to the golang source repo with a full explanation of what's going on with the bug later this afternoon

k4lizen commented 1 year ago

Then c.SetRequestTimeout() should probably do what this comment says right? I assume it doesn't currently, as then @JustinTimperio wouldn't have had the issue.

net.Dialer.Timeout and net.Dialer.Deadline are indeed for the controlling the deadline of the dial operation only. Setting a request deadline with http.NewRequestWithContext should work in all circumstances, including when reads or writes to the underlying connection are slow. If that doesn't work, we have a bug, but that does not seem to be the case here.

WGH- commented 1 year ago

But Collector.SetRequestTimeout sets net/http.Client's Timeout, which looks correct.

And thus repro in github.com/dherbst/deadlinetest is wrong, as it sets the dialer's timeout.

dherbst commented 1 year ago

I see that now, thanks. Fixed the test.

k4lizen commented 1 year ago

For me https://github.com/dherbst/deadlinetest doesn't hang. Also https://github.com/Project-Lernaean/CollyTimeoutError doesn't hang (even when using urls.txt.bak). Also, https://github.com/Project-Lernaean/CollyTimeoutError doesn't actually use SetRequestTimeout anywhere in the code.

dherbst commented 1 year ago

For me https://github.com/dherbst/deadlinetest doesn't hang.

It will not hang any more, I fixed the test.

It was originally hanging because I based it off the CollyTimeoutError repo, which omitted the timeout. I haven't looked at this code in a couple of years, but if I remember correctly the issue originally reported is fixed by setting the request timeout as I have in my test. This issue should probably be closed.

WGH- commented 1 year ago

And yet the original report had c.SetRequestTimeout, which has always set net/http.Client's Timeout.

Something isn't right, but I'm inclining to close this until some concrete reproduction arrives.

jrschumacher commented 1 week ago

I can confirm that setting the Deadline in the @dherbst's example solves the issue. So in the op example there should be no hanging requests, as seen below. My suspicion is that its hanging because it cannot dial out and the lack of a deadline leaves it open.

       c := colly.NewCollector(
               colly.MaxBodySize(10e9),
               colly.DetectCharset(),
               colly.Async(true),
       )
       c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 500})
       c.SetRequestTimeout(5 * time.Second)
       c.WithTransport(&http.Transport{
               Proxy: http.ProxyFromEnvironment,
               DialContext: (&net.Dialer{
                       Timeout:   5 * time.Second,
                       KeepAlive: 5 * time.Second,
                       Deadline:  5 * time.Second, // <---- add deadline
                       DualStack: true,
               }).DialContext,
               MaxIdleConns:          500,
               IdleConnTimeout:       5 * time.Second,
               TLSHandshakeTimeout:   5 * time.Second,
               ExpectContinueTimeout: 1 * time.Second,
       })

Edit: in my test environment I was able to process a significant quantity of links, but eventually ran into a hung process so maybe this is solve some situations, but not all.