ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
749 stars 142 forks source link

probe-engine: web_connectivity 0.5 DNS queries sometimes contains an empty hostname #2628

Closed hellais closed 5 months ago

hellais commented 7 months ago

In working on adding support for web connectivity 0.5 in ooni/data (https://github.com/ooni/data/issues/39) I noticed that some measurements don't contain the hostname field in the case of failed queries see for example: https://explorer.ooni.org/m/20231101133643.264648_BR_webconnectivity_39f7c5c2cea6cd96

Screenshot 2023-11-14 at 14 53 24

While we should fix it, it would also be good to have some pointers on what you believe should be the right way to handle these kinds of measurements in ooni/data to so we can fix the data to the extent that is possible.

bassosimone commented 7 months ago

The core issue here is that the redirect is for http://, which contains no domain. 🤔

bassosimone commented 7 months ago

We are going to write a test case for that and see what happens when using Web Connectivity v0.4.

bassosimone commented 5 months ago

With https://github.com/ooni/probe-cli/pull/1479, we can now reproduce the issue.

bassosimone commented 5 months ago

BTW, this issue is much more difficult than I expected so I'm bumping its complexity.

Here's a summary of my understanding so far:

  1. Web Connectivity v0.4 fails when redirecting with unknown_failure: http: no Host in request URL;
  2. Web Connectivity v0.5 fails with dns_nxdomain_error because the redirect URL has an empty domain.

We need to somehow find a way for the following to happen:

  1. the unknown failure should be mapped to a know failure so we don't mark the measurement as failed;
  2. we need to figure out the proper way to report error for v0.5, because obviously it's wrong now.
bassosimone commented 5 months ago

So, let's perhaps see what happens when we're using other programs and we get an http:// redirect.

Here's the test program:

package main

import (
    "net/http"
)

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        http.Redirect(w, r, "http://", http.StatusFound)
    })

    http.ListenAndServe(":8080", nil)
}

Here's what happens with cURL:

curl --location -v http://127.0.0.1:8080/
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
> GET / HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.4.0
> Accept: */*
> 
< HTTP/1.1 302 Found
< Content-Type: text/html; charset=utf-8
< Location: http://
< Date: Wed, 31 Jan 2024 14:11:53 GMT
< Content-Length: 30
< 
* Ignoring the response-body
* Connection #0 to host 127.0.0.1 left intact
* The redirect target URL could not be parsed: No host part in the URL
curl: (3) The redirect target URL could not be parsed: No host part in the URL

It looks similar to how net/http fails.

This is instead what happens when using Firefox:

Screenshot 2024-01-31 at 15 29 03

These tests confirm my initial hypothesis that the following text from RFC 9110 Sect 10.2.2:

The "Location" header field is used in some responses to refer to a specific resource in relation to the response. The type of relationship is defined by the combination of request method and status code semantics.

Location = URI-reference

The field value consists of a single URI-reference. When it has the form of a relative reference ([URI], Section 4.2), the final value is computed by resolving it against the target URI ([URI], Section 5).

and the following text from RFC 3986 Sect 4.2:

A relative reference takes advantage of the hierarchical syntax (Section 1.2.3) to express a URI reference relative to the name space of another hierarchical URI.

  relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

  relative-part = "//" authority path-abempty
                / path-absolute
                / path-noscheme
                / path-empty

The URI referred to by a relative reference, also known as the target URI, is obtained by applying the reference resolution algorithm of Section 5.

And in turn this other part of RFC 3986 Sect 3.2:

Many URI schemes include a hierarchical element for a naming authority so that governance of the name space defined by the remainder of the URI is delegated to that authority (which may, in turn, delegate it further). The generic syntax provides a common means for distinguishing an authority based on a registered name or server address, along with optional port and user information.

The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.

authority = [ userinfo "@" ] host [ ":" port ]

all seem to suggest that it's unexpected for the host to be empty there.

bassosimone commented 5 months ago

Given all of the above, I conclude that Web Connectivity v0.4 behavior of failing is correct. Now, the only remaining task is to make sure that we make Web Connectivity v0.5 fail in such a way that we can correctly attribute the failure.