Closed xhdix closed 3 years ago
I believe the reason why we're seeing this failure here is that there is no timeout. My recollection is also that the originally proposed fix was to add a timeout. A possible place where to put a timeout would be urlgetter
itself. We should implement this as a patch and then check again whether the issue is fixed. It may also be interesting to write a Jafar check for that.
This should be fixed by the work in https://github.com/ooni/probe-engine/pull/872. I have adopted a very minimal approach where the objective was to merely solve the problem at hand. A more complex approach could be implemented if needed.
Reopening because we've seen a relapse.
The code is blocked in saverSnapRead
, where it's not bounded by a context. It means we probably need to enforce using the context on a more deep level. It's not 100% clear what is the best way and most robust way of doing that. Setting a timeout for connections has the side effect that we are going to loose long living connections. We also need to ensure we apply this pattern everywhere where it could possibly happen that the injection of packets de-syncs the underlying network connection.
This is another instance in which the code failed:
[ 4.394232] <info> [1/1] running with input: dnslookup://example.com
[ 4.394916] <debug> no tunnel has been requested
[ 4.395333] <debug> resolve example.com...
[ 4.397165] <debug> > POST https://dns-family.adguard.com/dns-query
[ 4.397311] <debug> > Content-Type: application/dns-message
[ 4.397478] <debug> > Host: dns-family.adguard.com
[ 4.397619] <debug> > User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36
[ 4.397793] <debug> >
[ 4.398026] <debug> resolve dns-family.adguard.com...
[ 4.452800] <debug> resolve dns-family.adguard.com... ([176.103.130.134 176.103.130.132 2a00:5a60::bad1:ff 2a00:5a60::bad2:ff], <nil>) in 54.546417ms
[ 4.453265] <debug> dial 176.103.130.134:443/tcp...
[ 4.681474] <debug> dial 176.103.130.134:443/tcp... <nil> in 227.868657ms
[ 4.682093] <debug> tls {sni=dns-family.adguard.com next=[h2 http/1.1]}...
2020/07/10 14:01:45 line 40 handshake start
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter end
2020/07/10 14:01:45 byteconter read start
2020/07/10 14:01:45 byteconter end
2020/07/10 14:01:45 line 40 handshake finished
[ 4.933811] <debug> tls {sni=dns-family.adguard.com next=[h2 http/1.1]}... <nil> in 251.452512ms {next=h2 cipher=TLS_AES_256_GCM_SHA384 v=TLSv1.3}
2020/07/10 14:01:45 byteconter read start
This second example is more problematic because we don't know for sure the state of the goroutines. But it seems we're after the TLS handshake, while in the other case above we're clearly reading the body.
The work required to fix this issue is non-trivial. It seems we need to do more work than just ensuring we have deadlines for every context we use. I suppose it's smarted to convert this issue into an epic and subdivide the whole amount of work in smaller tasks.
I have terminated my initial investigation (https://github.com/ooni/probe/issues/1406) and created issues for the next steps regarding ensuring that the codebase is more robust to network interference (https://github.com/ooni/probe/issues/1416 and https://github.com/ooni/probe/issues/1417).
I fixed this issue for code using the new transport. We will eventually migrate all code to use the new transport. So, we can call this issue ~done.
Describe the bug In
urlgetter
(aladdin
), after finishing handshake, the program gets stuck, sometimes before the start of the receive and sometimes to complete the receivelogs terminated by
ctrl+\