TL;DR This issue is about a bug where the engineresolver performs a lookup even if the context is canceled. This leads to the component believing that a given resolver is broken where instead a parent-enforced timeout just occurred. This is the root cause preventing ooniprobe from bootstrapping in certain scenarios, including one described below. However, there is also a secondary issue in the code organization (discussed at the end).
Scenario Consider the case where OONI is doing a cold bootstrap. In such a case, the $HOME/.ooniprobe/engine/ directory is empty, so we do not have information on which resolvers are WAI. Let us also assume that all the DNS-over-HTTPS resolvers we may want to use are censored. In such a case, the bootstrap would fail during the iplookup.
In this fictional scenario, you would see the following logs. The logs have been reordered for readability and reorganized to explain what happens. I will guide you—as well as myself—through the logs.
We start by attempting to lookup using the cloudflare method. Below, I have reordered the logs (which are out of order because different goroutines may be scheduled at different times) to group related lookups together (I have have removed irrelevant logs that would impact on readability and on the ability to understand what happens):
[ 0.279993] <info> iplookup: using cloudflare... started
[ 0.279918] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... started
[ 0.303753] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... connection_refused
[ 0.304111] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... started
[ 0.325122] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... connection_refused
[ 0.325293] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... started
[ 0.352242] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... connection_refused
[ 0.352597] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... started
[ 0.853615] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... in progress
[ 4.353289] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... generic_timeout_error
[ 4.353522] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... started
[ 4.854381] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... in progress
[ 7.280025] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... generic_timeout_error
[ 7.280284] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... started
[ 7.280443] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... generic_timeout_error
[ 7.280683] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... started
[ 7.281053] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... generic_timeout_error
[ 7.281274] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... started
[ 7.281305] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... generic_timeout_error
[ 7.279903] <info> iplookup: using cloudflare... Get "https://www.cloudflare.com/cdn-cgi/trace": generic_timeout_error
You see that lookup#1, lookup#2 and lookup#3 all fail with connection_refused. Instead lookup#4 fails with a generic_timeout_error caused by a four-second per-lookup timeout.
But lookup#5 and lookup#6 fail before we hit the four seconds mark, because of the overall 7 seconds timeout that bounds the total runtime of the whole iplookup operation. Here we're clearly witnessing a conflict between different timeout policies. (See the end for more thoughts about the timeout conflict.)
Also, lookup#8 and lookup#9 have timestamps after the log where the iplookup component tells us it has given up, which is an instance of logs being slightly out of order due to the scheduling of goroutines that I mentioned above.
However, and this is particularly important, lookup#9 happens to be the system resolver, which I manually tested and know to be working in this fictional network. Unfortunately, because the engineresolver component schedules lookups even when its context deadline has expired, this means that the engineresolver believes that the system resolver is not working as intended and marks it as broken. Had not that happened, we would at least have been able to proceed with the bootstrap. This reckoning is what led me to label this issue as the root cause of the bootstrap failing in this scenario.
Code organization considerations We have several blocks of code that are synchronous and return a result to the caller or time out. The model.Resolver is a classical example. However, because of this code organization (which, to my defense looks natural in Go and works fine if the network is not working against you), we need to enforce timeouts inside several code blocks, to make the overall runtime bounded. It should be investigated whether a code organization based on pipelines returning information to callers through channels could be more robust and would require less timeouts. I am leaving this comment here just to document my thoughts while I was analyzing this issue, but it's clear that probably this is a different task that may require us to open and discuss on a separate issue.
TL;DR This issue is about a bug where the
engineresolver
performs a lookup even if the context is canceled. This leads to the component believing that a given resolver is broken where instead a parent-enforced timeout just occurred. This is the root cause preventing ooniprobe from bootstrapping in certain scenarios, including one described below. However, there is also a secondary issue in the code organization (discussed at the end).Scenario Consider the case where OONI is doing a cold bootstrap. In such a case, the
$HOME/.ooniprobe/engine/
directory is empty, so we do not have information on which resolvers are WAI. Let us also assume that all the DNS-over-HTTPS resolvers we may want to use are censored. In such a case, the bootstrap would fail during the iplookup.In this fictional scenario, you would see the following logs. The logs have been reordered for readability and reorganized to explain what happens. I will guide you—as well as myself—through the logs.
We start by attempting to lookup using the
cloudflare
method. Below, I have reordered the logs (which are out of order because different goroutines may be scheduled at different times) to group related lookups together (I have have removed irrelevant logs that would impact on readability and on the ability to understand what happens):So, overall the lookup lasts 7 seconds. This is because we have a 7 seconds timeout, which was included as a reliability fix according to a comment, and which was introduced by https://github.com/ooni/probe-cli/pull/678, addressing issues https://github.com/ooni/probe/issues/1988 and https://github.com/ooni/probe/issues/2013.
You see that
lookup#1
,lookup#2
andlookup#3
all fail withconnection_refused
. Insteadlookup#4
fails with ageneric_timeout_error
caused by a four-second per-lookup timeout.But
lookup#5
andlookup#6
fail before we hit the four seconds mark, because of the overall 7 seconds timeout that bounds the total runtime of the whole iplookup operation. Here we're clearly witnessing a conflict between different timeout policies. (See the end for more thoughts about the timeout conflict.)Also,
lookup#8
andlookup#9
have timestamps after the log where theiplookup
component tells us it has given up, which is an instance of logs being slightly out of order due to the scheduling of goroutines that I mentioned above.However, and this is particularly important,
lookup#9
happens to be the system resolver, which I manually tested and know to be working in this fictional network. Unfortunately, because theengineresolver
component schedules lookups even when its context deadline has expired, this means that theengineresolver
believes that the system resolver is not working as intended and marks it as broken. Had not that happened, we would at least have been able to proceed with the bootstrap. This reckoning is what led me to label this issue as the root cause of the bootstrap failing in this scenario.If you're curious, the bootstrap ends like this:
Also, this is the content of
sessionresolver.state
after this run:Code organization considerations We have several blocks of code that are synchronous and return a result to the caller or time out. The
model.Resolver
is a classical example. However, because of this code organization (which, to my defense looks natural in Go and works fine if the network is not working against you), we need to enforce timeouts inside several code blocks, to make the overall runtime bounded. It should be investigated whether a code organization based on pipelines returning information to callers through channels could be more robust and would require less timeouts. I am leaving this comment here just to document my thoughts while I was analyzing this issue, but it's clear that probably this is a different task that may require us to open and discuss on a separate issue.I covered this issue more in detail in https://github.com/ooni/probe/issues/2545.