ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
761 stars 142 forks source link

netx: blocked waiting for start receiving or ending receiving #1609

Closed xhdix closed 3 years ago

xhdix commented 4 years ago

Describe the bug In urlgetter (aladdin), after finishing handshake, the program gets stuck, sometimes before the start of the receive and sometimes to complete the receive

logs terminated by ctrl+\

[      3.539466] <info> Report ID: 20200601T235337Z_AS197207_42VHY0LKq4YHD6ueh6ePQXiIvPG82ARZbTZfZtN5uXRVW88qKm
[      3.539692] <info> [1/1] running with input: https://www.waze.com/
[      3.539904] <debug> no tunnel has been requested
[      3.540161] <debug> > GET https://www.waze.com/
[      3.540448] <debug> > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[      3.540612] <debug> > Accept-Language: en-US;q=0.8,en;q=0.5
[      3.540891] <debug> > User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
[      3.541147] <debug> > Host: www.waze.com
[      3.541433] <debug> >
[      3.541883] <debug> resolve www.waze.com...
[      3.542224] <debug> resolve www.waze.com... ([130.211.9.172], <nil>) in 1.992µs
[      3.542417] <debug> dial 130.211.9.172:443/tcp...
[      3.639942] <debug> dial 130.211.9.172:443/tcp... <nil> in 97.245881ms
[      3.640226] <debug> tls {sni=www.waze.com next=[h2 http/1.1]}...
[      3.757899] <debug> tls {sni=www.waze.com next=[h2 http/1.1]}... <nil> in 117.471011ms {next=h2 cipher=TLS_AES_128_GCM_SHA256 v=TLSv1.3}
[      5.132172] <debug> < 200
[      5.132342] <debug> < Server: nginx
[      5.132487] <debug> < Date: Mon, 01 Jun 2020 23:53:38 GMT
[      5.132640] <debug> < Content-Security-Policy-Report-Only: style-src 'self' web-assets.waze.com fonts.googleapis.com; object-src 'none'; base-uri 'none'; script-src 'self' 'unsafe-inline' web-assets.waze.com www.youtube.com www.googleadservices.com www.googletagmanager.com www.google-analytics.com googleads.g.doubleclick.net s.ytimg.com connect.facebook.net apis.google.com www.google.com www.gstatic.com ; report-uri https://csp.withgoogle.com/csp/waze/20200116_experiment;
[      5.132906] <debug> < Via: 1.1 google
[      5.133067] <debug> < Alt-Svc: h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
[      5.133338] <debug> < X-Content-Type-Options: nosniff
[      5.133581] <debug> < X-Request-Id: cf6669a7-772a-4748-bb2b-596f4e666561
[      5.133856] <debug> < X-Runtime: 0.034747
[      5.134058] <debug> < Cache-Control: max-age=600, public
[      5.134335] <debug> < X-Cloud-Trace-Context: 0dcadff3bf4128d9f3d0024b4a345b52/923450412863900458;o=0
[      5.134608] <debug> < X-Cloud-Trace-Context: 0dcadff3bf4128d9f3d0024b4a345b52
[      5.134909] <debug> < Content-Type: text/html; charset=utf-8
[      5.135137] <debug> < Vary: Accept-Encoding
[      5.135371] <debug> < Vary: Accept-Encoding
[      5.135601] <debug> < X-Frame-Options: SAMEORIGIN
[      5.135822] <debug> < Content-Length: 26104
[      5.136031] <debug> < X-Xss-Protection: 1; mode=block
[      5.136217] <debug> < Etag: W/"a7e6792417bd6564b5698d84bb90f9f6"
[      5.136441] <debug> <
SIGQUIT: quit
PC=0x466eb1 m=0 sigcode=128

goroutine 0 [idle]:
runtime.futex(0x189d2a8, 0x80, 0x0, 0x0, 0x7ffd00000000, 0x466ad5, 0x14b831, 0x10155d9, 0x7ffd966eea18, 0x40cecf, ...)
    /usr/lib/go-1.14/src/runtime/sys_linux_amd64.s:567 +0x21
runtime.futexsleep(0x189d2a8, 0x0, 0xffffffffffffffff)
    /usr/lib/go-1.14/src/runtime/os_linux.go:44 +0x46
runtime.notesleep(0x189d2a8)
    /usr/lib/go-1.14/src/runtime/lock_futex.go:151 +0x9f
runtime.stopm()
    /usr/lib/go-1.14/src/runtime/proc.go:1828 +0xc0
runtime.findrunnable(0xc000044000, 0x0)
    /usr/lib/go-1.14/src/runtime/proc.go:2360 +0xa0d
runtime.schedule()
    /usr/lib/go-1.14/src/runtime/proc.go:2520 +0x2fc
runtime.park_m(0xc000001080)
    /usr/lib/go-1.14/src/runtime/proc.go:2690 +0x9d
runtime.mcall(0xda0d60)
    /usr/lib/go-1.14/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [sync.Cond.Wait, 3 minutes]:
runtime.goparkunlock(...)
    /usr/lib/go-1.14/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc0000dc9e0, 0x0)
    /usr/lib/go-1.14/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc0000dc9d0)
    /usr/lib/go-1.14/src/sync/cond.go:56 +0x9d
net/http.(*http2pipe).Read(0xc0000dc9c8, 0xc0000f5beb, 0x215, 0x215, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/net/http/h2_bundle.go:3515 +0x8f
net/http.http2transportResponseBody.Read(0xc0000dc9a0, 0xc0000f5beb, 0x215, 0x215, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/net/http/h2_bundle.go:8519 +0xaf
io.(*LimitedReader).Read(0xc000222ca0, 0xc0000f5beb, 0x215, 0x215, 0x5eb, 0x0, 0x0)
    /usr/lib/go-1.14/src/io/io.go:451 +0x63
bytes.(*Buffer).ReadFrom(0xc00040f990, 0x11635c0, 0xc000222ca0, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/bytes/buffer.go:204 +0xb1
io/ioutil.readAll(0x11635c0, 0xc000222ca0, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/io/ioutil/ioutil.go:36 +0xe3
io/ioutil.ReadAll(...)
    /usr/lib/go-1.14/src/io/ioutil/ioutil.go:45
github.com/ooni/probe-engine/netx/httptransport.saverSnapRead(0x11727a0, 0xc0000dc9a0, 0x20000, 0x0, 0x0, 0x0, 0x0, 0x3)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/httptransport/saver.go:141 +0xaf
github.com/ooni/probe-engine/netx/httptransport.SaverBodyHTTPTransport.RoundTrip(0x11721a0, 0xc00000f640, 0xc00000f3a0, 0x0, 0xc0004a5400, 0xf6f3c0, 0xc0003fc901, 0xc00010ec80)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/httptransport/saver.go:125 +0xc9
github.com/ooni/probe-engine/netx/httptransport.SaverPerformanceHTTPTransport.RoundTrip(0x1172160, 0xc00000f660, 0xc00000f3a0, 0xc0004a5300, 0x5ed59500, 0xc00000f3b8, 0x1042508)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/httptransport/saver.go:39 +0x9f
github.com/ooni/probe-engine/netx/httptransport.SaverTransactionHTTPTransport.RoundTrip(0x11721e0, 0xc00000f680, 0xc00000f3a0, 0xc0004a5300, 0xfbfc8a, 0xa, 0xc00042d6b8)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/httptransport/saver.go:84 +0x107
github.com/ooni/probe-engine/netx/httptransport.UserAgentTransport.RoundTrip(0x1172220, 0xc00000f6a0, 0xc0004a5300, 0xc0003fc930, 0xc00018e688, 0x40ecd8)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/httptransport/useragent.go:16 +0x6e
net/http.send(0xc0004a5300, 0x7f88c684afb0, 0xc00028ebf0, 0x0, 0x0, 0x0, 0xc0000100e8, 0xc00028ebf0, 0x1, 0x0)
    /usr/lib/go-1.14/src/net/http/client.go:252 +0x43e
net/http.(*Client).send(0xc0003fc900, 0xc0004a5300, 0x0, 0x0, 0x0, 0xc0000100e8, 0x0, 0x1, 0xee8520)
    /usr/lib/go-1.14/src/net/http/client.go:176 +0xfa
net/http.(*Client).do(0xc0003fc900, 0xc0004a5300, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/net/http/client.go:699 +0x44a
net/http.(*Client).Do(...)
    /usr/lib/go-1.14/src/net/http/client.go:567
github.com/ooni/probe-engine/experiment/urlgetter.Runner.httpGet(0x7ffd966ef514, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/experiment/urlgetter/runner.go:61 +0x545
github.com/ooni/probe-engine/experiment/urlgetter.Runner.Run(0x7ffd966ef514, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/experiment/urlgetter/runner.go:32 +0x18f
github.com/ooni/probe-engine/experiment/urlgetter.Getter.get(0x7ffd966ef514, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/experiment/urlgetter/getter.go:84 +0x576
github.com/ooni/probe-engine/experiment/urlgetter.Getter.Get(0x7ffd966ef514, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/experiment/urlgetter/getter.go:28 +0xfe
github.com/ooni/probe-engine/experiment/urlgetter.measurer.Run(0x7ffd966ef514, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/experiment/urlgetter/urlgetter.go:75 +0x160
github.com/ooni/probe-engine.(*Experiment).MeasureWithContext(0xc000110f50, 0x117b4e0, 0xc0000a4000, 0x7ffd966ef531, 0x15, 0x3, 0x0, 0x0)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/experiment.go:290 +0x23d
github.com/ooni/probe-engine.(*Experiment).Measure(...)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/experiment.go:275
github.com/ooni/probe-engine/libminiooni.MainWithConfiguration(0x7ffd966ef547, 0x9, 0xc0003f87c0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc0003eebd0, ...)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/libminiooni/libminiooni.go:372 +0x1094
github.com/ooni/probe-engine/libminiooni.Main()
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/libminiooni/libminiooni.go:139 +0xab
main.main()
    /media/sf_mk/work/src/aladdin/cmd/aladdin/main.go:6 +0x20

goroutine 55 [IO wait, 3 minutes]:
internal/poll.runtime_pollWait(0x7f88c684aa48, 0x72, 0xffffffffffffffff)
    /usr/lib/go-1.14/src/runtime/netpoll.go:203 +0x55
internal/poll.(*pollDesc).wait(0xc00010ed98, 0x72, 0x1000, 0x104f, 0xffffffffffffffff)
    /usr/lib/go-1.14/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
    /usr/lib/go-1.14/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc00010ed80, 0xc0001bc000, 0x104f, 0x104f, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/internal/poll/fd_unix.go:169 +0x19b
net.(*netFD).Read(0xc00010ed80, 0xc0001bc000, 0x104f, 0x104f, 0x19, 0x0, 0x0)
    /usr/lib/go-1.14/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc0000b44e0, 0xc0001bc000, 0x104f, 0x104f, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/net/net.go:184 +0x8e
github.com/ooni/probe-engine/netx/dialer.ErrorWrapperConn.Read(0x11861a0, 0xc0000b44e0, 0xdb70, 0x5, 0xc0001bc000, 0x104f, 0x104f, 0x5ed59502, 0xc026c15b6c, 0x4d2d7ccc70fed)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/dialer/errorwrapper.go:43 +0x70
github.com/ooni/probe-engine/netx/dialer.saverConn.Read(0x1185de0, 0xc00021a780, 0xc00000f3a0, 0xc0001bc000, 0x104f, 0x104f, 0xc0001ba298, 0xc00014d930, 0x6353a4)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/dialer/saver.go:96 +0x9e
github.com/ooni/probe-engine/netx/dialer.byteCounterConnWrapper.Read(0x11863e0, 0xc00021a7a0, 0xc000263f80, 0xc0003eec60, 0xc0001bc000, 0x104f, 0x104f, 0xc0001bc005, 0x0, 0x104a)
    /media/sf_mk/work/src/aladdin/vendor/github.com/ooni/probe-engine/netx/dialer/bytecounter.go:74 +0x5c
crypto/tls.(*atLeastReader).Read(0xc000222d00, 0xc0001bc000, 0x104f, 0x104f, 0x573, 0x104a, 0xc00014da30)
    /usr/lib/go-1.14/src/crypto/tls/conn.go:760 +0x60
bytes.(*Buffer).ReadFrom(0xc0001173d8, 0x1162640, 0xc000222d00, 0x40c3b5, 0xe93320, 0xf62440)
    /usr/lib/go-1.14/src/bytes/buffer.go:204 +0xb1
crypto/tls.(*Conn).readFromUntil(0xc000117180, 0x7f88c684ce58, 0xc00021a7c0, 0x5, 0xc00021a7c0, 0x562)
    /usr/lib/go-1.14/src/crypto/tls/conn.go:782 +0xec
crypto/tls.(*Conn).readRecordOrCCS(0xc000117180, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/crypto/tls/conn.go:589 +0x115
crypto/tls.(*Conn).readRecord(...)
    /usr/lib/go-1.14/src/crypto/tls/conn.go:557
crypto/tls.(*Conn).Read(0xc000117180, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/crypto/tls/conn.go:1233 +0x15b
bufio.(*Reader).Read(0xc000395ec0, 0xc000019ae1, 0x51f, 0x51f, 0x562, 0x0, 0x0)
    /usr/lib/go-1.14/src/bufio/bufio.go:226 +0x24f
io.ReadAtLeast(0x1162460, 0xc000395ec0, 0xc000018000, 0x2000, 0x2000, 0x2000, 0x1, 0x0, 0x0)
    /usr/lib/go-1.14/src/io/io.go:310 +0x87
io.ReadFull(...)
    /usr/lib/go-1.14/src/io/io.go:329
net/http.(*http2Framer).ReadFrame(0xc0003609a0, 0xc000162150, 0x0, 0x0, 0x0)
    /usr/lib/go-1.14/src/net/http/h2_bundle.go:1745 +0x13f
net/http.(*http2clientConnReadLoop).run(0xc00014dfa8, 0xc0001172b8, 0x1042508)
    /usr/lib/go-1.14/src/net/http/h2_bundle.go:8246 +0x8d
net/http.(*http2ClientConn).readLoop(0xc000198900)
    /usr/lib/go-1.14/src/net/http/h2_bundle.go:8174 +0x6f
created by net/http.(*http2Transport).newClientConn
    /usr/lib/go-1.14/src/net/http/h2_bundle.go:7174 +0x64a

rax    0xca
rbx    0x189d160
rcx    0x466eb3
rdx    0x0
rdi    0x189d2a8
rsi    0x80
rbp    0x7ffd966ee9e0
rsp    0x7ffd966ee998
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x3
r13    0x189c680
r14    0x7f88c8088000
r15    0x0
rip    0x466eb1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
bassosimone commented 4 years ago

I believe the reason why we're seeing this failure here is that there is no timeout. My recollection is also that the originally proposed fix was to add a timeout. A possible place where to put a timeout would be urlgetter itself. We should implement this as a patch and then check again whether the issue is fixed. It may also be interesting to write a Jafar check for that.

bassosimone commented 4 years ago

This should be fixed by the work in https://github.com/ooni/probe-engine/pull/872. I have adopted a very minimal approach where the objective was to merely solve the problem at hand. A more complex approach could be implemented if needed.

bassosimone commented 3 years ago

Reopening because we've seen a relapse.

bassosimone commented 3 years ago

The code is blocked in saverSnapRead, where it's not bounded by a context. It means we probably need to enforce using the context on a more deep level. It's not 100% clear what is the best way and most robust way of doing that. Setting a timeout for connections has the side effect that we are going to loose long living connections. We also need to ensure we apply this pattern everywhere where it could possibly happen that the injection of packets de-syncs the underlying network connection.

bassosimone commented 3 years ago

This is another instance in which the code failed:

[      4.394232] <info> [1/1] running with input: dnslookup://example.com
[      4.394916] <debug> no tunnel has been requested
[      4.395333] <debug> resolve example.com...
[      4.397165] <debug> > POST https://dns-family.adguard.com/dns-query
[      4.397311] <debug> > Content-Type: application/dns-message
[      4.397478] <debug> > Host: dns-family.adguard.com
[      4.397619] <debug> > User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36
[      4.397793] <debug> >
[      4.398026] <debug> resolve dns-family.adguard.com...
[      4.452800] <debug> resolve dns-family.adguard.com... ([176.103.130.134 176.103.130.132 2a00:5a60::bad1:ff 2a00:5a60::bad2:ff], <nil>) in 54.546417ms
[      4.453265] <debug> dial 176.103.130.134:443/tcp...
[      4.681474] <debug> dial 176.103.130.134:443/tcp... <nil> in 227.868657ms
[      4.682093] <debug> tls {sni=dns-family.adguard.com next=[h2 http/1.1]}...
2020/07/10 14:01:45 line 40 handshake start
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter  end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter  end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter  end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter  end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter  end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter  end
2020/07/10 14:01:45 line 40 handshake finished
[      4.933811] <debug> tls {sni=dns-family.adguard.com next=[h2 http/1.1]}... <nil> in 251.452512ms {next=h2 cipher=TLS_AES_256_GCM_SHA384 v=TLSv1.3}
2020/07/10 14:01:45 byteconter read start

This second example is more problematic because we don't know for sure the state of the goroutines. But it seems we're after the TLS handshake, while in the other case above we're clearly reading the body.

bassosimone commented 3 years ago

The work required to fix this issue is non-trivial. It seems we need to do more work than just ensuring we have deadlines for every context we use. I suppose it's smarted to convert this issue into an epic and subdivide the whole amount of work in smaller tasks.

bassosimone commented 3 years ago

I have terminated my initial investigation (https://github.com/ooni/probe/issues/1406) and created issues for the next steps regarding ensuring that the codebase is more robust to network interference (https://github.com/ooni/probe/issues/1416 and https://github.com/ooni/probe/issues/1417).

bassosimone commented 3 years ago

I fixed this issue for code using the new transport. We will eventually migrate all code to use the new transport. So, we can call this issue ~done.