ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
750 stars 142 forks source link

Timeouts, overlapped operations, and composability #2545

Open bassosimone opened 9 months ago

bassosimone commented 9 months ago

TL;DR The work on beacons (see https://github.com/ooni/probe/issues/2531) led me to reflect on how using a composable channel-based pattern could help us to keep the DSL composability and enable overlapped operations in light of our timeout policy, thus allowing us to use the available time more efficiently.

Background My initial prototype for beacons (see https://github.com/ooni/probe/issues/2531) had this structure:

stateDiagram
  GenerateTactics --> UseTactics

Where:

In other words, if Resolver has this interface:

type Resolver interface {
  LookupHost(ctx context.Context, domain string) []string, error
}

the GenerateTactics interface was:

type GenerateTactics interface {
  LookupTactics(ctx context.Context, domain string) []*Tactic, error
}

This initial design stemmed from the observation that, by replacing a Resolver with a GenerateTactics, and by adapting the TCP and TLS dial accordingly, we could implement the desired beacons functionality.

In fact, the initial implementation of GenerateTactics was just a wrapper for a Resolver that converted the resolved IP addresses to tactics; and the initial implementation of UseTactics was refactored from a trivial loop that tries each available IP address with TCP connect and TLS handshake until one IP address work or all have failed.

However, quite soon I modified GenerateTactics to become:

type GenerateTactics interface {
  LookupTactics(ctx context.Context, domain string) <-chan *Tactic
}

This issue is here to explain (1) why I applied this change and (2) how we can stretch this design change to achieve beneficial outcomes in terms of efficiency (i.e., how many attempts we can pack in N seconds) and composability.

Efficiency I applied this change because I realized that I wanted UseTactics to start running as soon as possible (i.e., using the already known beacons addresses) without waiting for the underlying DNS lookup performed by GenerateTactics to complete successfully or return an error. My reasoning was that the first attempt could start right away while the DNS lookup was still in progress. After thinking a bit more about this, I realized that, by applying this pattern systematically, we could pack more timeout-bound attempts into a fixed amount of seconds, even factoring in happy eyeballs. (In this context, happy eyeballs is the process of staggering the tactics such that they do not all start immediately—but crucially we don't wait for attempt N to fail to start attempt N+1.)

Let us now abstract from the specific use case I was working on, and focus instead on Web Connectivity LTE. There, we roughly have the following structure:

stateDiagram
    state DNSScheduler <<fork>>
    URLToMeasure --> DNSScheduler
    DNSScheduler --> DNSLookupGetaddrinfo
    DNSScheduler --> DNSLookupUDP
    DNSScheduler --> DNSLookupHTTPS
    state DNSBarrier <<join>>
    state EndpointMeasurer <<fork>>
    DNSLookupHTTPS --> DNSBarrier
    DNSLookupGetaddrinfo --> DNSBarrier
    DNSLookupUDP --> DNSBarrier
    DNSBarrier --> ScheduleEndpoints
    ScheduleEndpoints --> EndpointMeasurer
    EndpointMeasurer --> TCPConnect#1
    EndpointMeasurer --> TCPConnect#2
    TCPConnect#1 --> ...#1
    TCPConnect#2 --> ...#2

As you can see, endpoint measurements need to wait for three DNS resolvers to complete. This fact reduces the measurement efficiency in light of timeouts. For example, if DNS over HTTPS times out, this timeout is likely four seconds, and this timeout is additive to additional timeouts we may see down the line (e.g., during TCP connect).

Crucially, in DNSOverUDP we also want to check whether there are additional IP addresses returned by late replies, which usually are caused by censorship (the GFW, for example, works like this). While we currently have support for collecting these late replies and include them as measurements in Web Connectivity v0.5, it is not very practical for the code to wait for them before returning IP addresses to the DNSBarrier state.

Imagine, instead, there was no DNSBarrier, rather just a channel that streams resolved IP addresses. In such a case we would be able to start testing early. This means that we would be able to overlap more operations in presence of timeouts and initiate measuring addresses from late replies (if not duplicate) when they become available.

Composability The DSL (./internal/dslx) composes functions; for example:

function := dslx.Compose(dslx.TCPConnect(), dslx.TLSHandshake())

creates a composed function that performs a TCP connect followed by a TLS handshake. Now, channels are also very composable in Go (and probably composing channels is as idiomatic, if not more, than composing functions).

So, this interface:

type Func[A, B any] interface {
  Run(ctx context.Context, input A) *Maybe[B]
}

could become something like:

type Pipeline[A, B any] interface{
  Run(ctx context.Context, input <-chan A) <-chan *Maybe[B]
}

While still being composable, this pattern has the benefit that we can have overlapped operations as mentioned above.

What we should do The ./internal/dslx package should be refactored to use a channel based pattern. This package is not heavily used yet, and I am still convinced we should use it to rewrite experiments because it has the functional property that we can decouple what and how. We also have completed the work of writing good QA tests with netem, which means we're now well positioned to start rewriting tests using the DSL. Using a channel based refactoring for the DSL is a good idea before starting to rewrite because it opens up the possibility, later on, to go down the stack and apply channel based patterns to other building blocks (e.g., the DNS-over-UDP resolver, such that we can always deliver to a consumer the additional IP addresses discovered by parsing late DNS replies).