engineresolver: bootstrap failure b/c of lookup w/ canceled context

TL;DR This issue is about a bug where the engineresolver performs a lookup even if the context is canceled. This leads to the component believing that a given resolver is broken where instead a parent-enforced timeout just occurred. This is the root cause preventing ooniprobe from bootstrapping in certain scenarios, including one described below. However, there is also a secondary issue in the code organization (discussed at the end).

Scenario Consider the case where OONI is doing a cold bootstrap. In such a case, the $HOME/.ooniprobe/engine/ directory is empty, so we do not have information on which resolvers are WAI. Let us also assume that all the DNS-over-HTTPS resolvers we may want to use are censored. In such a case, the bootstrap would fail during the iplookup.

In this fictional scenario, you would see the following logs. The logs have been reordered for readability and reorganized to explain what happens. I will guide you—as well as myself—through the logs.

We start by attempting to lookup using the cloudflare method. Below, I have reordered the logs (which are out of order because different goroutines may be scheduled at different times) to group related lookups together (I have have removed irrelevant logs that would impact on readability and on the ability to understand what happens):

[      0.279993] <info> iplookup: using cloudflare... started

[      0.279918] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... started
[      0.303753] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... connection_refused

[      0.304111] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... started
[      0.325122] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... connection_refused

[      0.325293] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... started
[      0.352242] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... connection_refused

[      0.352597] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... started
[      0.853615] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... in progress
[      4.353289] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... generic_timeout_error

[      4.353522] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... started
[      4.854381] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... in progress
[      7.280025] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... generic_timeout_error

[      7.280284] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... started
[      7.280443] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... generic_timeout_error

[      7.280683] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... started
[      7.281053] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... generic_timeout_error

[      7.281274] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... started
[      7.281305] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... generic_timeout_error

[      7.279903] <info> iplookup: using cloudflare... Get "https://www.cloudflare.com/cdn-cgi/trace": generic_timeout_error

So, overall the lookup lasts 7 seconds. This is because we have a 7 seconds timeout, which was included as a reliability fix according to a comment, and which was introduced by https://github.com/ooni/probe-cli/pull/678, addressing issues https://github.com/ooni/probe/issues/1988 and https://github.com/ooni/probe/issues/2013.

You see that lookup#1, lookup#2 and lookup#3 all fail with connection_refused. Instead lookup#4 fails with a generic_timeout_error caused by a four-second per-lookup timeout.

But lookup#5 and lookup#6 fail before we hit the four seconds mark, because of the overall 7 seconds timeout that bounds the total runtime of the whole iplookup operation. Here we're clearly witnessing a conflict between different timeout policies. (See the end for more thoughts about the timeout conflict.)

Also, lookup#8 and lookup#9 have timestamps after the log where the iplookup component tells us it has given up, which is an instance of logs being slightly out of order due to the scheduling of goroutines that I mentioned above.

However, and this is particularly important, lookup#9 happens to be the system resolver, which I manually tested and know to be working in this fictional network. Unfortunately, because the engineresolver component schedules lookups even when its context deadline has expired, this means that the engineresolver believes that the system resolver is not working as intended and marks it as broken. Had not that happened, we would at least have been able to proceed with the bootstrap. This reckoning is what led me to label this issue as the root cause of the bootstrap failing in this scenario.

If you're curious, the bootstrap ends like this:

[     28.288537] <info> iplookup: using stun_google... sessionresolver: LookupHost failed: [ <https://mozilla.cloudflare-dns.com/dns-query> connection_refused; <https://dns.google/dns-query> connection_refused; <https://dns.quad9.net/dns-query> connection_refused; <http3://cloudflare-dns.com/dns-query> generic_timeout_error; <http3://dns.google/dns-query> generic_timeout_error; <https://cloudflare-dns.com/dns-query> generic_timeout_error; <http3://mozilla.cloudflare-dns.com/dns-query> generic_timeout_error; <system:///> generic_timeout_error;]

Also, this is the content of sessionresolver.state after this run:

[
  {
    "URL": "https://mozilla.cloudflare-dns.com/dns-query",
    "Score": 7.424810779254613E-13
  },
  {
    "URL": "https://dns.google/dns-query",
    "Score": 8.105952955347611E-13
  },
  {
    "URL": "https://dns.quad9.net/dns-query",
    "Score": 8.624602747864999E-13
  },
  {
    "URL": "http3://cloudflare-dns.com/dns-query",
    "Score": 5.73807306573324E-13
  },
  {
    "URL": "http3://dns.google/dns-query",
    "Score": 4.0903976870662584E-13
  },
  {
    "URL": "https://cloudflare-dns.com/dns-query",
    "Score": 1.0603838730495632E-13
  },
  {
    "URL": "http3://mozilla.cloudflare-dns.com/dns-query",
    "Score": 2.1736065526426893E-14
  },
  {
    "URL": "system:///",
    "Score": 0
  }
]

Code organization considerations We have several blocks of code that are synchronous and return a result to the caller or time out. The model.Resolver is a classical example. However, because of this code organization (which, to my defense looks natural in Go and works fine if the network is not working against you), we need to enforce timeouts inside several code blocks, to make the overall runtime bounded. It should be investigated whether a code organization based on pipelines returning information to callers through channels could be more robust and would require less timeouts. I am leaving this comment here just to document my thoughts while I was analyzing this issue, but it's clear that probably this is a different task that may require us to open and discuss on a separate issue.

I covered this issue more in detail in https://github.com/ooni/probe/issues/2545.

ooni / probe

engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544