ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
754 stars 142 forks source link

wcth: compare old and new test helpers #1707

Closed bassosimone closed 2 years ago

bassosimone commented 3 years ago

This issue is about fetching all the URLs in all the test lists with both helpers and comparing the results. We want to understand in which cases the new test helper differs from the old test helper and why.

You may want to jump straight to the conclusions: https://github.com/ooni/probe/issues/1707#issuecomment-945631485.

bassosimone commented 2 years ago

Here's a status update. In late July I run a first pass comparison of the whole content of the test lists between the old and the new test helper.

To this end, I used the code at https://github.com/ooni/wcthcheck.

The codename of this first comparison is config-0. Running the ./compare tool of the above mentioning repository using config-0, we get:

{
  "dns/match/exactly": 19518,
  "dns/match/same_asn": 3362,
  "dns/match/same_org": 122,
  "dns/mismatch/cannot_map_to_asn": 6,
  "dns/mismatch/new_data_none_addrs": 47,
  "dns/mismatch/other": 431,
  "dns/total": 23486,

  "http_body_length/diff": 9333,
  "http_body_length/diff/new_th_larger": 6368,
  "http_body_length/diff/new_th_smaller": 2965,
  "http_body_length/total": 23480,

  "http_failure/match": 22363,
  "http_failure/mismatch": 1117,
  "http_failure/mismatch/new_is_none": 45,
  "http_failure/mismatch/old_is_none": 292,
  "http_failure/mismatch/other": 780,
  "http_failure/total": 23480,

  "http_headers/match/same_set": 19516,
  "http_headers/mismatch/new_is_none": 1037,
  "http_headers/mismatch/set_diff": 2927,
  "http_headers/total": 23480,

  "http_status_code/match": 21916,
  "http_status_code/mismatch": 1564,
  "http_status_code/total": 23480,

  "http_title/match/both_empty": 2502,
  "http_title/match/equal": 11180,
  "http_title/mismatch/new_empty": 749,
  "http_title/mismatch/old_empty": 935,
  "http_title/mismatch/old_th_smaller": 7683,
  "http_title/mismatch/other": 431,
  "http_title/mismatch/total": 8114,
  "http_title/total": 23480,

  "processing/match/has_both_measurements": 23486,
  "processing/mismatch/missing_newth_measurement": 7,
  "processing/mismatch/missing_oldth_measurement": 4478,
  "processing/succeeded": 27971,
  "processing/total": 27971,

  "tcp_connect/match": 23251,
  "tcp_connect/mismatch": 235,
  "tcp_connect/mismatch/added_ivp4": 60,
  "tcp_connect/mismatch/different_ivp4": 185,
  "tcp_connect/mismatch/removed_ivp4": 73,
  "tcp_connect/total": 23486
}

After this initial scanning, I started trying to figure out the reasons why there was such a difference. Generally speaking, what I did was to look into a subset of measurements to identify bugs.

The first issue I figured out is that is that the internal/cmd/oohelper command in ooni/probe-cli was stopping when the input URL's domain name was NXDOMAIN. This was fixed at https://github.com/ooni/probe-cli/pull/450.

After which I started looking into the subset of URLs that presents tcp_connect mismatches. This subset of URLs has been used for config-1, config-2, config-3 and config-4 runs.

For this new comparison, I deployed the legacy backend on my local machine inside a Docker container running an ubuntu:16.04 container. To do that smoothly I needed https://github.com/ooni/backend/pull/533 to pin working deps.

While investigating, there was some serendipity. I assumed the legacy backend was using 8.8.8.8:53/udp as a resolver, but instead it was using the system resolver. This led me to https://github.com/ooni/probe/issues/1780.

Subsequent changes to the new test helper landed at: https://github.com/ooni/probe-cli/pull/504.

This brings us to today. Here's the next-two-weeks agenda:

The latter will be done by censoring the network on the computer I am using via Jafar.

bassosimone commented 2 years ago

Written status update. We can now move this issue to Sprint 48.

bassosimone commented 2 years ago

(Effort tracking wise, this issue has now become a monster that I've been dragging along for some time now 😬)

bassosimone commented 2 years ago

Steps for testing the legacy probe

We need to run it into a suitable docker container:

docker run --network=host -it -v`pwd`:/ooni -w/ooni ubuntu:16.04

Once we're inside the container, we need to update apt:

apt update

Once we've done that step, we install the dependencies:

apt install python-pip libssl-dev

Then, we install the packages we need via pip:

pip install -r requirements.txt

Finally, we can run the backend:

./bin/oonib -c oonibackend.conf

This is the configuration file I am using:

helpers:
  daphn3:
    address: null
    pcap_file: null
    port: 57003
    yaml_file: null
  dns:
    address: null
    resolver_address: 8.8.8.8:53
    tcp_port: null
    udp_port: null
  dns_discovery:
    address: null
    resolver_address: null
    tcp_port: null
    udp_port: null
  http-return-json-headers:
    address: null
    port: null
    server_version: Apache
  ssl:
    address: null
    certificate: null
    private_key: null
    port: null
  tcp-echo:
    address: null
    port: null
  web_connectivity:
    endpoints:
      - {type: tcp, port: 9020 }

main:
  archive_dir: ./x/archive
  bouncer_file: null
  chroot: null
  debug: false
  deck_dir: ./x/decks
  euid: null
  gid: null
  input_dir: ./x/inputs
  logfile: ./x/oonibackend.log
  no_save: true
  nodaemon: true
  originalname: null
  pidfile: ./x/oonib.pid
  policy_file: null
  profile: null
  report_dir: ./x/raw_reports
  rundir: .
  socks_port: 9055
  stale_time: 3600
  tor2webmode: false
  tor_binary: null
  tor_datadir: ./tor
  tor_hidden_service: true
  uid: null
  umask: null

  bouncer_endpoints: []
  collector_endpoints: []

  report_file_template: '{iso8601_timestamp}-{test_name}-{report_id}-{probe_asn}-{probe_cc}-probe-0.2.0.{ext}'
bassosimone commented 2 years ago

Digging into DNS differences

I have experimentally determined that we need to ensure the legacy and the new TH use exactly the same DNS. My initial attempt to solve this problem has been to force 8.8.8.8:53 as the legacy TH resolver and to also disable any caching:

diff --git a/oonib/testhelpers/http_helpers.py b/oonib/testhelpers/http_helpers.py
index c012c00..16aee09 100644
--- a/oonib/testhelpers/http_helpers.py
+++ b/oonib/testhelpers/http_helpers.py
@@ -36,6 +36,9 @@ from oonib.common.tcp_utils import TCPConnectFactory
 from oonib.handlers import OONIBHandler

+DNS_CLIENT = dns_client.createResolver(servers=[("8.8.8.8", 53)])
+
+
 class SimpleHTTPChannel(basic.LineReceiver, policies.TimeoutMixin):
     """
     This is a simplified version of twisted.web.http.HTTPChannel to overcome
@@ -220,7 +223,7 @@ def encodeResponses(response):

 class WebConnectivityCache(object):
     expiration_time = 200
-    enable_caching = True
+    enable_caching = False
     http_retries = 2

     def __init__(self):
@@ -420,7 +423,7 @@ class WebConnectivityCache(object):
         }

         try:
-            records = yield dns_client.lookupAddress(hostname)
+            records = yield DNS_CLIENT.lookupAddress(hostname)
             answers = records[0]
             for answer in answers:
                 if answer.type is dns.A:

Yet, even with this setup, there was a small set of URLs that still returned different DNS replies. Therefore, I chose to eventually force the two to have the same DNS by running jafar as follows:

sudo ./jafar -iptables-hijack-dns-to 127.0.0.1:53

This hijacks DNS traffic to jafar's own resolver, which at least it behaving consistently.

We're now at config-6.json and the list of input URLs has now become:

http://97dounai.top/
http://makemodel.net/
http://www.bjnews.com.cn/
http://www.isa.gov.il/
http://www.isa.gov.il/Pages/default.aspx
http://www.isa.gov.il/Pages/default.aspx/
http://www.isa.gov.il/sites/ISAEng/Pages/default.aspx/
http://www.kamayutmedia.com/
http://zheg.nastie.co.uk/
https://www.sie.gov.hk/tc/
https://www.terredeshommes.nl/

The URLs above are the ones for which we persistently continued to see some DNS differences. Now that I have enforced exactly the same DNS using jafar, this is the result of testing the above list:

{
  "dns/match/exactly": 10,
  "dns/mismatch/other": 1,
  "dns/total": 11,
  "http_body_length/diff": 4,
  "http_body_length/diff/new_th_larger": 4,
  "http_body_length/total": 11,
  "http_failure/match": 10,
  "http_failure/mismatch": 1,
  "http_failure/mismatch/other": 1,
  "http_failure/total": 11,
  "http_headers/match/same_set": 10,
  "http_headers/mismatch/new_is_none": 1,
  "http_headers/total": 11,
  "http_status_code/match": 11,
  "http_status_code/total": 11,
  "http_title/match/both_empty": 5,
  "http_title/match/equal": 1,
  "http_title/mismatch/new_empty": 3,
  "http_title/mismatch/old_th_smaller": 2,
  "http_title/mismatch/total": 2,
  "http_title/total": 11,
  "processing/match/has_both_measurements": 11,
  "processing/succeeded": 11,
  "processing/total": 11,
  "tcp_connect/match": 10,
  "tcp_connect/mismatch": 1,
  "tcp_connect/mismatch/added_ivp4": 8,
  "tcp_connect/total": 11
}

Where the only remaining DNS mismatch is the following:

dns/diffs http://97dounai.top/ 
    only_old_data ['108.160.173.207@AS19679'] 
    only_new_data ['31.13.83.34@AS32934']

We can explain the above mismatch https://github.com/ooni/probe/issues/1780.

Having mostly settled the DNS case, we now need to look into other differences, but it's fair to say that maybe we should also run the oohelper client in the same host, so to really enforce the same DNS for everyone. (Sad that this did not occur to me earlier, I guess I was a bit too optimistic about the possibility of enforcing the same DNS without using force.)

bassosimone commented 2 years ago

Running oohelper in the box that is also running jafar (see https://github.com/ooni/probe/issues/1707#issuecomment-942132684) and applying https://github.com/ooni/probe-cli/pull/541 and https://github.com/ooni/probe-cli/pull/542, reduced the differences JSON for config-6.json to:

{
  "dns/match/exactly": 11,
  "dns/total": 11,
  "http_body_length/diff": 4,
  "http_body_length/diff/new_th_smaller": 4,
  "http_body_length/total": 11,
  "http_failure/match": 10,
  "http_failure/mismatch": 1,
  "http_failure/mismatch/other": 1,
  "http_failure/total": 11,
  "http_headers/match/same_set": 11,
  "http_headers/total": 11,
  "http_status_code/match": 11,
  "http_status_code/total": 11,
  "http_title/match/both_empty": 5,
  "http_title/match/equal": 1,
  "http_title/mismatch/old_th_smaller": 5,
  "http_title/mismatch/total": 5,
  "http_title/total": 11,
  "processing/match/has_both_measurements": 11,
  "processing/succeeded": 11,
  "processing/total": 11,
  "tcp_connect/match": 10,
  "tcp_connect/mismatch": 1,
  "tcp_connect/mismatch/added_ivp4": 1,
  "tcp_connect/mismatch/removed_ivp4": 1,
  "tcp_connect/total": 11
}

It seems the differences at this point boil down to:

http failure mismatch for http://97dounai.top/: old=connection_refused_error => new=connection_refused
tcp connect mismatch http://www.isa.gov.il/Pages/default.aspx ***' {'212.68.130.207:80': {'status': True, 'failure': None}} '*** ***' {'194.90.36.207:80': {'status': True, 'failure': None}} '***

Maybe a way to additionally avoid noise with flipping IP addresses is to add caching to Jafar: I did not expect this flip in www.isa.gov.il. OTOH, maybe this is not needed and would end up spending too much time on this comparison?

So, the next step is to perform more A/B testing and ensuring the new TH emits the same error strings as the new TH, which is something we've not been testing extensively so far.

bassosimone commented 2 years ago

Making sure the errors are the same

To approach this problem, we'll basically try to map errors from netxlite to strings emitted by the old TH. To understand which strings are emitted by the old TH, we need to read the source code.

The code that matters is at https://github.com/ooni/backend/blob/6ec4fda5b18/oonib/testhelpers/http_helpers.py. There are three try...except of interest: one of DNS, one for TCP connect, and one for HTTP.

I have implemented a diff (soon to be committed along with tests) that maps netxlite errors to TH error strings. Let us see whether this diff is working okay for us by running some manual tests.

DNS checks

Result in case of NXDOMAIN

Let us cause NXDOMAIN using jafar like:

./jafar -iptables-hijack-dns-to 127.0.0.1:53 -dns-proxy-block example.com

Let us now call the old and the new THs running locally using http://example.com as input.

Here's the old TH's output:

{
    "tcp_connect": {
        "93.184.216.34:80": {
            "status": true,
            "failure": null
        },
        "[2606:2800:220:1:248:1893:25c8:1946]:80": {
            "status": false,
            "failure": "invalid_socket"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "dns_lookup_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "dns_name_error",
        "addrs": []
    }
}

And here is the new TH result:

{
    "tcp_connect": {
        "93.184.216.34:80": {
            "status": true,
            "failure": null
        },
        "[2606:2800:220:1:248:1893:25c8:1946]:80": {
            "status": false,
            "failure": "connect_error"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "dns_lookup_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "dns_name_error",
        "addrs": []
    }
}

In conclusion: ✔️

Result in case of DNS timeout

./jafar -iptables-hijack-dns-to 127.0.0.1:53 -dns-proxy-ignore example.com

Old TH:

{
    "tcp_connect": {
        "93.184.216.34:80": {
            "status": true,
            "failure": null
        },
        "[2606:2800:220:1:248:1893:25c8:1946]:80": {
            "status": false,
            "failure": "invalid_socket"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "dns_lookup_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "unknown_error",
        "addrs": []
    }
}

New TH:

{
    "tcp_connect": {
        "93.184.216.34:80": {
            "status": true,
            "failure": null
        },
        "[2606:2800:220:1:248:1893:25c8:1946]:80": {
            "status": false,
            "failure": "connect_error"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "generic_timeout_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "unknown_error",
        "addrs": []
    }
}

Conclusion: 〰️ (we have the same error for dns but we don't have the same error for http because the new code returns a generic timeout error when there is a timeout during the DNS phase; I am going to consider this good enough given that new probes do not really seem to use the control failure in any way)

TCP

Here we're going to use http://dns.google and its 8.8.8.8:443 and 8.8.4.4:443 endpoints.

Connection refused

./jafar -iptables-reset-ip 8.8.8.8 -iptables-reset-ip 8.8.4.4

Old TH:

{
    "tcp_connect": {
        "8.8.4.4:80": {
            "status": false,
            "failure": "connection_refused_error"
        },
        "8.8.8.8:80": {
            "status": false,
            "failure": "connection_refused_error"
        },
        "[2001:4860:4860::8844]:80": {
            "status": false,
            "failure": "invalid_socket"
        },
        "[2001:4860:4860::8888]:80": {
            "status": false,
            "failure": "invalid_socket"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "connection_refused_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": null,
        "addrs": [
            "8.8.8.8",
            "8.8.4.4"
        ]
    }
}

New TH:

{
    "tcp_connect": {
        "8.8.4.4:80": {
            "status": false,
            "failure": "connection_refused_error"
        },
        "8.8.8.8:80": {
            "status": false,
            "failure": "connection_refused_error"
        },
        "[2001:4860:4860::8844]:80": {
            "status": false,
            "failure": "connect_error"
        },
        "[2001:4860:4860::8888]:80": {
            "status": false,
            "failure": "connect_error"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "connection_refused_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": null,
        "addrs": [
            "8.8.8.8",
            "8.8.4.4",
            "2001:4860:4860::8888",
            "2001:4860:4860::8844"
        ]
    }
}

Conclusion: ✔️

Connect timed out

./jafar -iptables-reset-ip 8.8.8.8 -iptables-reset-ip 8.8.4.4

Old TH:

{
    "tcp_connect": {
        "8.8.4.4:80": {
            "status": false,
            "failure": "generic_timeout_error"
        },
        "8.8.8.8:80": {
            "status": false,
            "failure": "generic_timeout_error"
        },
        "[2001:4860:4860::8844]:80": {
            "status": false,
            "failure": "invalid_socket"
        },
        "[2001:4860:4860::8888]:80": {
            "status": false,
            "failure": "invalid_socket"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "generic_timeout_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "unknown_error",
        "addrs": []
    }
}

New TH:

{
    "tcp_connect": {
        "8.8.4.4:80": {
            "status": false,
            "failure": "generic_timeout_error"
        },
        "8.8.8.8:80": {
            "status": false,
            "failure": "generic_timeout_error"
        },
        "[2001:4860:4860::8844]:80": {
            "status": false,
            "failure": "connect_error"
        },
        "[2001:4860:4860::8888]:80": {
            "status": false,
            "failure": "connect_error"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "generic_timeout_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "unknown_error",
        "addrs": []
    }
}

The DNS result is an unfortunate outcome of Jafar being too broad.

Conclusion: ✔️

TLS

We continue to use dns.google but we try to filter the TLS handshake.

Connection reset

./jafar -iptables-reset-keyword google

The old TH resets the connection and we therefore see the following error in the oohelper client:

panic: client.Do failed: Post "http://192.168.1.6:9020/": connection_reset

goroutine 1 [running]:
github.com/ooni/probe-cli/v3/internal/runtimex.PanicOnError(...)
    /Users/sbs/src/github.com/bassosimone/ooniwcthcheck/probe-cli/internal/runtimex/runtimex.go:10
main.wcth()
    /Users/sbs/src/github.com/bassosimone/ooniwcthcheck/probe-cli/internal/cmd/oohelper/oohelper.go:92 +0x1b8
main.main()
    /Users/sbs/src/github.com/bassosimone/ooniwcthcheck/probe-cli/internal/cmd/oohelper/oohelper.go:60 +0x2b0

The new TH returns:

{
    "tcp_connect": {
        "8.8.4.4:443": {
            "status": true,
            "failure": null
        },
        "8.8.8.8:443": {
            "status": true,
            "failure": null
        },
        "[2001:4860:4860::8844]:443": {
            "status": false,
            "failure": "connect_error"
        },
        "[2001:4860:4860::8888]:443": {
            "status": false,
            "failure": "connect_error"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "unknown_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": null,
        "addrs": [
            "8.8.8.8",
            "8.8.4.4",
            "2001:4860:4860::8844",
            "2001:4860:4860::8888"
        ]
    }
}

Conclusion: ✔️ (new TH works better when this happens)

Handshake timeout

It seems we cannot easily implement this test case for Jafar at the moment. (Maybe circle back later?)

Conclusion: 😿

Invalid certificate

Here we're using https://expired.badssl.com as the test case without any Jafar usage.

Old TH:

{
    "tcp_connect": {
        "104.154.89.105:443": {
            "status": true,
            "failure": null
        }
    },
    "http_request": {
        "body_length": 494,
        "failure": null,
        "title": "expired.badssl.com",
        "headers": {
            "Cache-Control": "no-store",
            "Content-Type": "text/html",
            "Date": "Wed, 13 Oct 2021 13:00:58 GMT",
            "ETag": "W/\"6164d5db-1ee\"",
            "Last-Modified": "Tue, 12 Oct 2021 00:24:59 GMT",
            "Server": "nginx/1.10.3 (Ubuntu)"
        },
        "status_code": 200
    },
    "dns": {
        "failure": null,
        "addrs": [
            "104.154.89.105"
        ]
    }
}

New TH:

{
    "tcp_connect": {
        "104.154.89.105:443": {
            "status": true,
            "failure": null
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "unknown_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": null,
        "addrs": [
            "104.154.89.105"
        ]
    }
}

Conclusion: ✔️ (it's not so good that the legacy TH does not verify certificates, see https://github.com/ooni/probe/issues/1828)

HTTP

We basically have already tested all the handled cases above.

Conclusion

The new TH matches the old TH in most cases. There are some differences. Most probes are new probes nowadays and they do not match with old TH strings except for "dns_name_error", which we're currently handling.

bassosimone commented 2 years ago

So, what remains now is to submit the code that tries to match errors along with some unit tests. Then, there is also need to continue running the tool for a more broad set of inputs and circle back and figure out what other differences we have.

bassosimone commented 2 years ago

Further improvements to error mapping

The main issue identifier above is that we do not map correctly and error happening during DNS lookup when doing HTTP and there was this lingering concern that it may have confused legacy probes. Now I have a new diff, let's test it.

./jafar -iptables-hijack-dns-to 127.0.0.1:53 -dns-proxy-ignore example.com

Old TH:

{
    "tcp_connect": {
        "93.184.216.34:80": {
            "status": true,
            "failure": null
        },
        "[2606:2800:220:1:248:1893:25c8:1946]:80": {
            "status": false,
            "failure": "invalid_socket"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "dns_lookup_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "unknown_error",
        "addrs": []
    }
}

New TH:

{
    "tcp_connect": {
        "93.184.216.34:80": {
            "status": true,
            "failure": null
        },
        "[2606:2800:220:1:248:1893:25c8:1946]:80": {
            "status": false,
            "failure": "connect_error"
        }
    },
    "http_request": {
        "body_length": -1,
        "failure": "dns_lookup_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "unknown_error",
        "addrs": []
    }
}

So, this extra improvement removes the biggest concern we had in https://github.com/ooni/probe/issues/1707#issuecomment-942283746

bassosimone commented 2 years ago

PR https://github.com/ooni/probe-cli/pull/543 contains the fixes mentioned in https://github.com/ooni/probe/issues/1707#issuecomment-942283746 and https://github.com/ooni/probe/issues/1707#issuecomment-942341255

bassosimone commented 2 years ago

Understanding larger old TH bodies

With these URLs:

http://97dounai.top/
http://makemodel.net/
http://www.bjnews.com.cn/
http://www.isa.gov.il/
http://www.isa.gov.il/Pages/default.aspx
http://www.isa.gov.il/Pages/default.aspx/
http://www.isa.gov.il/sites/ISAEng/Pages/default.aspx/
http://www.kamayutmedia.com/
http://zheg.nastie.co.uk/
https://www.sie.gov.hk/tc/
https://www.terredeshommes.nl/

and the above setting with Jafar enforcing DNS, we're now down to:

{
  "dns/match/exactly": 11,
  "dns/total": 11,
  "http_body_length/mismatch": 4,
  "http_body_length/mismatch/new_th_smaller": 4,
  "http_body_length/total": 10,
  "http_failure/match": 10,
  "http_failure/total": 10,
  "http_headers/match/same_set": 10,
  "http_headers/total": 10,
  "http_status_code/match": 10,
  "http_status_code/total": 10,
  "http_title/match/both_empty": 4,
  "http_title/match/equal": 1,
  "http_title/mismatch/old_th_smaller": 5,
  "http_title/mismatch/total": 5,
  "http_title/total": 10,
  "processing/match/has_both_measurements": 11,
  "processing/succeeded": 11,
  "processing/total": 11,
  "tcp_connect/match": 11,
  "tcp_connect/total": 11
}

We can explain the old TH having smaller titles with the fact that the old TH strips non latin chars: ✔️

We don't understand the difference in the body length: ❌

These are the bodies that differ:

body length diff for http://makemodel.net/: old=1509 => new=1441
body length diff for http://www.bjnews.com.cn/: old=196611 => new=97113
body length diff for http://www.kamayutmedia.com/: old=1537 => new=1469
body length diff for http://zheg.nastie.co.uk/: old=1525 => new=1457

Let's investigate them!

Let's first patch ooni/probe-cli to see the bodies returned by the legacy TH:

commit 802a36c0383e30c0406c6d08b38af86e0d1b44a5
Author: Simone Basso <bassosimone@gmail.com>
Date:   Thu Oct 14 14:37:35 2021 +0200

    feat(wcth client): also read the response bodies

diff --git a/internal/engine/experiment/webconnectivity/control.go b/internal/engine/experiment/webconnectivity/control.go
index 23c4e06..e887d4f 100644
--- a/internal/engine/experiment/webconnectivity/control.go
+++ b/internal/engine/experiment/webconnectivity/control.go
@@ -27,11 +27,19 @@ type ControlTCPConnectResult struct {
 // ControlHTTPRequestResult is the result of the HTTP request
 // performed by the control vantage point.
 type ControlHTTPRequestResult struct {
+   BodyLength int64                 `json:"body_length"`
+   Responses  []ControlHTTPResponse `json:"responses"`
+   Failure    *string               `json:"failure"`
+   Title      string                `json:"title"`
+   Headers    map[string]string     `json:"headers"`
+   StatusCode int64                 `json:"status_code"`
+}
+
+type ControlHTTPResponse struct {
    BodyLength int64             `json:"body_length"`
-   Failure    *string           `json:"failure"`
-   Title      string            `json:"title"`
    Headers    map[string]string `json:"headers"`
-   StatusCode int64             `json:"status_code"`
+   Code       int64             `json:"code"`
+   Body       string            `json:"body"`
 }

 // ControlDNSResult is the result of the DNS lookup

Now let's patch the legacy backend to return such bodies:

diff --git a/oonib/testhelpers/http_helpers.py b/oonib/testhelpers/http_helpers.py
index c012c00..48bf35e 100644
--- a/oonib/testhelpers/http_helpers.py
+++ b/oonib/testhelpers/http_helpers.py
@@ -36,6 +36,9 @@ from oonib.common.tcp_utils import TCPConnectFactory
 from oonib.handlers import OONIBHandler

+DNS_CLIENT = dns_client.createResolver(servers=[("8.8.8.8", 53)])
+
+
 class SimpleHTTPChannel(basic.LineReceiver, policies.TimeoutMixin):
     """
     This is a simplified version of twisted.web.http.HTTPChannel to overcome
@@ -309,7 +312,7 @@ class WebConnectivityCache(object):
         defer.returnValue(value)

     @defer.inlineCallbacks
-    def http_request(self, url, http_request_headers, include_http_responses=False):
+    def http_request(self, url, http_request_headers, include_http_responses=True):
         key = url + json.dumps(http_request_headers)
         cached_value = yield self.lookup('http_request', key)
         if cached_value is not None:
@@ -373,8 +376,6 @@ class WebConnectivityCache(object):
             log.exception(exc)

         yield self.cache_value('http_request', key, page_info)
-        if include_http_responses is not True:
-            page_info.pop('responses', None)
         defer.returnValue(page_info)

     @defer.inlineCallbacks
@@ -420,7 +421,7 @@ class WebConnectivityCache(object):
         }

         try:
-            records = yield dns_client.lookupAddress(hostname)
+            records = yield DNS_CLIENT.lookupAddress(hostname)
             answers = records[0]
             for answer in answers:
                 if answer.type is dns.A:
@@ -462,7 +463,7 @@ class WebConnectivity(OONIBHandler):

         hostname = urlparse(http_url).netloc
         dl = [
-            web_connectivity_cache.http_request(http_url, http_request_headers, include_http_responses),
+            web_connectivity_cache.http_request(http_url, http_request_headers, True),
             web_connectivity_cache.dns_consistency(hostname)
         ]
         for socket in socket_list:
@@ -526,7 +527,7 @@ class WebConnectivity(OONIBHandler):
             self.validate_request(request)
             include_http_responses = request.get(
                     "include_http_responses",
-                    False
+                    True
             )

             # We convert headers to str so twisted is happy (unicode triggers
@@ -536,7 +537,7 @@ class WebConnectivity(OONIBHandler):
                 http_request_headers[str(k)] = map(str, v)
             self.control_measurement(
                 http_url=str(request['http_request']),
-                include_http_responses=include_http_responses,
+                include_http_responses=True,
                 http_request_headers=http_request_headers,
                 socket_list=request['tcp_connect'],
                 invalid_sockets=invalid_sockets

Now we're equipped to use ooclient to see the bodies according to the legacy TH!

makemodel.net

Body returned by the legacy TH:

"\u003c!doctype html\u003e\u003chtml lang=\"en\" data-adblockkey=\"MFwwDQYJKoZIhvcNAQEBBQADSwAwSAJBANDrp2lz7AOmADaN8tA50LsWcjLFyQFcb/P2Txc58oYOeILb3vBw7J6f4pamkAQVSQuqYsKx3YzdUHCvbVZvFUsCAwEAAQ==_G46S+eVhBq7+En4flzoVQE35p5nreOinT5O16OdUvHN/YR249BHQAAowJE1flUrTbNbCZypbKnzTY4OpFAM75g==\"\u003e\u003chead\u003e\u003cmeta charset=\"utf-8\"\u003e\u003cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1\"\u003e\u003clink rel=\"shortcut icon\" href=\"/favicon.ico\" type=\"image/x-icon\"/\u003e\u003clink rel=\"preconnect\" href=\"https://www.google.com\" crossorigin\u003e\u003clink rel=\"dns-prefetch\" href=\"https://parking.bodiscdn.com\" crossorigin\u003e\u003clink rel=\"dns-prefetch\" href=\"https://fonts.googleapis.com\" crossorigin\u003e\u003c/head\u003e\u003cbody\u003e\u003cdiv id=\"target\" style='opacity: 0'\u003e\u003c/div\u003e\u003cscript\u003ewindow.park = \"eyJ1dWlkIjoiMzcxNDM5Y2YtYzZmYy02YTg5LWJhMjctNWYyMGEzNTE4OGYxIiwicGFnZV90aW1lIjoxNjM0MjE1NDUwLCJwYWdlX3VybCI6Imh0dHA6XC9cL21ha2Vtb2RlbC5uZXRcLyIsInBhZ2VfbWV0aG9kIjoiR0VUIiwicGFnZV9yZXF1ZXN0IjpbXSwicGFnZV9oZWFkZXJzIjp7InVzZXItYWdlbnQiOlsiTW96aWxsYVwvNS4wIChXaW5kb3dzIE5UIDEwLjA7IFdpbjY0OyB4NjQpIEFwcGxlV2ViS2l0XC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBDaHJvbWVcLzkwLjAuNDQzMC45MyBTYWZhcmlcLzUzNy4zNiJdLCJhY2NlcHQiOlsidGV4dFwvaHRtbCxhcHBsaWNhdGlvblwveGh0bWwreG1sLGFwcGxpY2F0aW9uXC94bWw7cT0wLjksKlwvKjtxPTAuOCJdLCJhY2NlcHQtZW5jb2RpbmciOlsiZ3ppcCJdLCJhY2NlcHQtbGFuZ3VhZ2UiOlsiZW4tVVM7cT0wLjgsZW47cT0wLjUiXSwiaG9zdCI6WyJtYWtlbW9kZWwubmV0Il0sImNvbm5lY3Rpb24iOlsiY2xvc2UiXX0sImhvc3QiOiJtYWtlbW9kZWwubmV0IiwiaXAiOiIyLjM0LjI1Ljk0In0=\";\u003c/script\u003e\u003cscript src=\"/js/parking.2.72.0.js\"\u003e\u003c/script\u003e\u003c/body\u003e\u003c/html\u003e"

It's indeed 1534 bytes also according to echo $body|wc.

To see the body according to the new TH, we brutally use miniooni -ni $url urlgetter. Here's the body:

"<!doctype html><html lang=\"en\" data-adblockkey=\"MFwwDQYJKoZIhvcNAQEBBQADSwAwSAJBANDrp2lz7AOmADaN8tA50LsWcjLFyQFcb/P2Txc58oYOeILb3vBw7J6f4pamkAQVSQuqYsKx3YzdUHCvbVZvFUsCAwEAAQ==_G46S+eVhBq7+En4flzoVQE35p5nreOinT5O16OdUvHN/YR249BHQAAowJE1flUrTbNbCZypbKnzTY4OpFAM75g==\"><head><meta charset=\"utf-8\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><link rel=\"shortcut icon\" href=\"/favicon.ico\" type=\"image/x-icon\"/><link rel=\"preconnect\" href=\"https://www.google.com\" crossorigin><link rel=\"dns-prefetch\" href=\"https://parking.bodiscdn.com\" crossorigin><link rel=\"dns-prefetch\" href=\"https://fonts.googleapis.com\" crossorigin></head><body><div id=\"target\" style='opacity: 0'></div><script>window.park = \"eyJ1dWlkIjoiOWVhZmE2MDItOGU4OC1lNWYxLTliYzItYThiYzAxYmExY2JkIiwicGFnZV90aW1lIjoxNjM0MjE1NjU2LCJwYWdlX3VybCI6Imh0dHA6XC9cL21ha2Vtb2RlbC5uZXRcLyIsInBhZ2VfbWV0aG9kIjoiR0VUIiwicGFnZV9yZXF1ZXN0IjpbXSwicGFnZV9oZWFkZXJzIjp7ImFjY2VwdC1sYW5ndWFnZSI6WyJlbi1VUztxPTAuOCxlbjtxPTAuNSJdLCJhY2NlcHQiOlsidGV4dFwvaHRtbCxhcHBsaWNhdGlvblwveGh0bWwreG1sLGFwcGxpY2F0aW9uXC94bWw7cT0wLjksKlwvKjtxPTAuOCJdLCJ1c2VyLWFnZW50IjpbIk1vemlsbGFcLzUuMCAoV2luZG93cyBOVCAxMC4wOyBXaW42NDsgeDY0KSBBcHBsZVdlYktpdFwvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lXC85MC4wLjQ0MzAuOTMgU2FmYXJpXC81MzcuMzYiXSwiaG9zdCI6WyJtYWtlbW9kZWwubmV0Il19LCJob3N0IjoibWFrZW1vZGVsLm5ldCIsImlwIjoiMi4zNC4yNS45NCJ9\";</script><script src=\"/js/parking.2.72.0.js\"></script></body></html>"

According to echo $body|wc this is 1476 bytes 🤔🤔🤔. We expected to actually see 1441.

Other URLs

So, apart from the fact that the body size is not always constant (and by inspection this seems to depend on pages that are not fully static, so it's understandable), we see the same pattern being used here. The legacy TH uses a specific JSON encoding that encodes all HTML tags while the new TH does not. This leads to a larger body according to the legacy TH.

Conclusion

We have a reasonable explanation for some cases where the legacy TH body is larger ✔️.

At present, it's not straightforward to compare the bodies merely using their size as metric and we would actually need to go deeper and modify also the new TH to always return the full body, which allows us to really compare them.

bassosimone commented 2 years ago

Further investigating DNS differences (1/N)

New issue with new TH code:

sqlite> select * from stats where key like "dns/mismatch/new_data_none_addrs";
dns/mismatch/new_data_none_addrs|http://119.59.99.174/~net2519/
dns/mismatch/new_data_none_addrs|https://9.9.9.9/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|https://8.8.4.4/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|http://180.215.14.121/
dns/mismatch/new_data_none_addrs|https://1.0.0.1/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|http://72.14.203.104/search?hl=zh-CN&q=cache:http://www.sh.xcmc.com.cn/
dns/mismatch/new_data_none_addrs|http://72.14.203.104/
dns/mismatch/new_data_none_addrs|http://119.82.71.95/guyanatimes/
dns/mismatch/new_data_none_addrs|http://124.225.213.74/list-motss-1.shtml/
dns/mismatch/new_data_none_addrs|http://206.189.185.159/
dns/mismatch/new_data_none_addrs|https://1.1.1.3/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|https://149.112.112.9/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|https://1.0.0.3/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|https://1.1.1.1/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|https://8.8.8.8/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB
dns/mismatch/new_data_none_addrs|http://212.129.24.11/
dns/mismatch/new_data_none_addrs|http://198.96.92.14/

For this set of URLs, the new TH produces a null DNS response where instead it should produce an empty list.

bassosimone commented 2 years ago

Further investigating DNS differences (2/N)

Here's a new case for which old and new differ:

======= http://www.wougnet.org/ @ http://127.0.0.1:8080/ =======
{
  "tcp_connect": {},
  "http_request": {
    "body_length": -1,
    "failure": "dns_lookup_error",
    "title": "",
    "headers": {},
    "status_code": -1
  },
  "dns": {
    "failure": "unknown_error",
    "addrs": []
  }
}
======= http://www.wougnet.org/ @ http://127.0.0.1:9020/ =======
{
  "tcp_connect": {},
  "http_request": {
    "body_length": -1,
    "failure": "dns_lookup_error",
    "title": "",
    "headers": {},
    "status_code": -1
  },
  "dns": {
    "failure": "dns_name_error",
    "addrs": []
  }
}

Let me investigate further.

Okay, the issue is quite annoyingly there:

./oohelper -server http://127.0.0.1:8080 -target http://www.wougnet.org/
{
    "tcp_connect": {},
    "http_request": {
        "body_length": -1,
        "failure": "dns_lookup_error",
        "title": "",
        "headers": {},
        "status_code": -1
    },
    "dns": {
        "failure": "unknown_error",
        "addrs": []
    }
}

Why?

So, it turns out this issue may be the following: to behave exactly like the legacy TH, we are not mapping DNS timeouts correctly. Let us now assume there was a DNS timeout here, which seems reasonable, right? So, we would end up with an unknown_error. I think this explanation is correct, because the HTTP failure is dns_lookup_error, which is another way to express the NXDOMAIN by mapping it to what the old TH would do. Hence, I think it's reasonable to classify this kind of errors as "there was a timeout during the DNS, sorry". It stills annoys that this happens. I have seen that Jafar is really not very efficient and uses lots of CPU. To be fair, it was never designed for efficiency and the way in which I am using it here as part of this investigation is quite beyond the use cases for which it has been designed.

Yet, the interesting question is whether I can figure out what is the reason causing us to have DNS timeouts by reading at Jafar source code and checking whether I could figure that out. Otherwise, another approach is to take Jafar out of the equation and re-run the whole list, which however did not seem to always work as intended for smaller lists. This reckoning brings me to think whether I could figure out why Jafar is actually needed?

bassosimone commented 2 years ago

Further investigating DNS differences (3/N)

Yet, the interesting question is whether I can figure out what is the reason causing us to have DNS timeouts by reading at Jafar source code and checking whether I could figure that out.

The reason is actually quite simple: Jafar's code for processing DNS is single threaded, therefore it cannot actually cope easily with the fact that I am running 16 parallel instances of oohelper. This seems reasonable. So, if I want to keep using Jafar here, what I need to do is to modify its resolver code to use a few parallel goroutines.

This reckoning brings me to think whether I could figure out why Jafar is actually needed?

So, reading Twisted 13.2.0 source code, I see the following snippet:

def createResolver(servers=None, resolvconf=None, hosts=None):
    # snip
    from twisted.names import resolve, cache, root, hosts as hostsModule
    if platform.getType() == 'posix':
        if resolvconf is None:
            resolvconf = b'/etc/resolv.conf'
        if hosts is None:
            hosts = b'/etc/hosts'
        theResolver = Resolver(resolvconf, servers)
        hostResolver = hostsModule.Resolver(hosts)
    else:
        # snip
    L = [hostResolver, cache.CacheResolver(), theResolver]
    return resolve.ResolverChain(L)

So, I would conclude that, because I am using createResolver I am actually using a resolver chain that uses a composed resolver that uses both /etc/resolv.conf and the resolver I have chosen to use. Hence, my conclusion is that my patch to make oonib always use only 8.8.8.8:53 is not a good patch and also brings in systemd-resolved.

Therefore, I can dispose of Jafar here by basically doing the following: (1) fixing the diff for oonib to only use a resolver that always uses 8.8.8.8:53 and (2) by taking Jafar out of the equation. Once I've done that, I also need to zap many URLs for which the DNS result was not the expected one, to ensure I basically retest all the cases for which the combination of Jafar being single threaded and the parallelism caused a bunch of queries to fail.

✔️

(As a side note, it's always instructive to read through the source code of Twisted <3)

bassosimone commented 2 years ago

Further investigating DNS differences (4/N)

Here's another interesting case:

======= https://tv.ennaharonline.com/ @ http://127.0.0.1:9020/ =======
{
  "tcp_connect": {},
  "http_request": {
    "body_length": -1,
    "failure": "dns_lookup_error",
    "title": "",
    "headers": {},
    "status_code": -1
  },
  "dns": {
    "failure": null,
    "addrs": []
  }
}
======= https://tv.ennaharonline.com/ @ http://127.0.0.1:8080/ =======
{
  "tcp_connect": {},
  "http_request": {
    "body_length": -1,
    "failure": "dns_lookup_error",
    "title": "",
    "headers": {},
    "status_code": -1
  },
  "dns": {
    "failure": "dns_server_failure",
    "addrs": []
  }
}

This is what happens if I run dig:

% dig tv.ennaharonline.com @1.1.1.1

; <<>> DiG 9.16.21-RH <<>> tv.ennaharonline.com @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27268
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;tv.ennaharonline.com.          IN      A

;; AUTHORITY SECTION:
ennaharonline.com.      1800    IN      SOA     dns200.anycast.me. tech.ovh.net. 2021070500 86400 3600 3600000 86400

;; Query time: 28 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Fri Oct 15 15:53:46 CEST 2021
;; MSG SIZE  rcvd: 114

So, the approach in the OONI Go codebase has always been to flag it as an error if there is no answer. On the contrary, in such error condition, the legacy TH was instead is returning an empty reply. Which one is correct? There is no clear answer here but it may be useful to do what the legacy TH was doing in the new TH anyway (assuming this is actually possible).

Let us take a look at what https://github.com/ooni/probe-cli/tree/2d1666b88bb9346f533f36c1ce903e5c71984c10 does with respect to the DNS. So, we first have this snippet from internal/netxlite/serialresolver.go:

// LookupHost performs an A lookup followed by an AAAA lookup for hostname.
func (r *SerialResolver) LookupHost(ctx context.Context, hostname string) ([]string, error) {
    var addrs []string
    addrsA, errA := r.lookupHostWithRetry(ctx, hostname, dns.TypeA)
    addrsAAAA, errAAAA := r.lookupHostWithRetry(ctx, hostname, dns.TypeAAAA)
    if errA != nil && errAAAA != nil {
        return nil, errA
    }
    addrs = append(addrs, addrsA...)
    addrs = append(addrs, addrsAAAA...)
    return addrs, nil
}

Basically, if both A and AAAA fail, we return the A error. (This logic stems from the assumption that it's more likely to have an A address than to have an AAAA address.) Now, let's look at internal/netxlite/dnsdecoder.go:

func (d *DNSDecoderMiekg) DecodeLookupHost(qtype uint16, data []byte) ([]string, error) {
    reply, err := d.parseReply(data)
    if err != nil {
        return nil, err
    }
    var addrs []string
    for _, answer := range reply.Answer {
        switch qtype {
        case dns.TypeA:
            if rra, ok := answer.(*dns.A); ok {
                ip := rra.A
                addrs = append(addrs, ip.String())
            }
        case dns.TypeAAAA:
            if rra, ok := answer.(*dns.AAAA); ok {
                ip := rra.AAAA
                addrs = append(addrs, ip.String())
            }
        }
    }
    if len(addrs) <= 0 {
        return nil, ErrOODNSNoAnswer
    }
    return addrs, nil
}

So, if the Rcode is okay, we loop through the answers. If we don't find any suitable IP address, we return to the caller that there has actually been no answer. Thus, we should be able to fix the new TH by intercepting this error case and returning an empty list without an error rather than an error (which is currently mapped to dns_lookup_error).

Let's try.

Done! We fixed this issue in PR: https://github.com/ooni/probe-cli/pull/546.

Here's the output of the comparison restricted to the small set of data that was problematic:

{
  "dns/match/exactly": 21,
  "dns/total": 21,
  "http_body_length/mismatch": 3,
  "http_body_length/mismatch/new_th_larger": 1,
  "http_body_length/mismatch/new_th_smaller": 2,
  "http_body_length/total": 7,
  "http_failure/match": 7,
  "http_failure/total": 7,
  "http_headers/match/same_set": 6,
  "http_headers/mismatch/set_diff": 1,
  "http_headers/total": 7,
  "http_status_code/match/equal": 6,
  "http_status_code/mismatch/different": 1,
  "http_status_code/total": 7,
  "http_title/match/equal": 4,
  "http_title/mismatch/old_empty": 1,
  "http_title/mismatch/old_th_smaller": 2,
  "http_title/mismatch/total": 3,
  "http_title/total": 7,
  "processing/match/has_both_measurements": 21,
  "processing/succeeded": 21,
  "processing/total": 21,
  "tcp_connect/match": 21,
  "tcp_connect/total": 21
}

So, now we're good because DNS matches exactly ✔️

Now it's time to investigate other differences.

bassosimone commented 2 years ago

Further investigating DNS differences (5/5)

I solved more DNS related issues and branched off an issue I could not solve easily: https://github.com/ooni/probe/issues/1823.

This is the current DNS situation for most ~80% URLs in the test list:

{
  "dns/match/exactly": 25642,
  "dns/match/same_asn": 158,
  "dns/match/same_failure": 552,
  "dns/mismatch/different_failure": 1,
  "dns/mismatch/old_failure": 19,
  "dns/mismatch/other": 5,
[snip]

The different_failure case is the one for which I branched off the new issue. I still need to chase the old_failure and other cases to figure out what is their meaning.

So, most of them were transient, we're not looking at this:

  "dns/known_bug/old_cannot_handle_ip": 17,
  "dns/match/exactly": 25824,
  "dns/match/same_asn": 160,
  "dns/match/same_failure": 552,
  "dns/mismatch/different_failure": 3,
  "dns/mismatch/new_failure": 4,
  "dns/mismatch/old_failure": 1,

There are 17 entries were the old TH cannot handle IPs in the URL, which I think it's a behaviour that the new TH does not actually want to mirror. Then there are 8 failures that maybe we need to further investigate. Overall, though, the DNS part seems to be converging (I am continuing to scan the TLs but now I am near to cases where probably the domains are censored where I live because every measurement is taking such a long time).

bassosimone commented 2 years ago

HTTP headers

I also started looking into HTTP headers. I found more cases in which the set is disjoint, so I added more headers as exception, but then I also measured in which cases both sets ended up being empty. This is not looking good:

  "http_headers/failure/new": 883,
  "http_headers/failure/old": 5141,
  "http_headers/match/no_header_left": 9378,
  "http_headers/match/same_set": 9060,
  "http_headers/mismatch/set_diff": 925,
  "http_headers/total": 19363,

There are two issues here: (1) we have many HTTP failures for which we probably need to retry and (2) the more I add exception headers the more I increase the no_header_left. That is: set_diff and no_header_left cannot both be optimised at the same time. So perhaps the approach I am using for headers here is wrong. I think we should just eliminate the headers that the probe would not consider without adding further exceptions. Doing that now.

Okay, so this is what we have:

  "http_headers/failure/new": 885,
  "http_headers/failure/old": 5142,
  "http_headers/match/no_header_left": 4064,
  "http_headers/match/same_set": 4975,
  "http_headers/mismatch/set_diff": 10331,
  "http_headers/total": 19370,

This is, in itself, a data quality issue. The two different implementations return different headers that matter according to Web Connectivity's heuristics in many cases. This confuses new probes a bit (unclear the extent). If we switch to the new TH, we now confuse new probes less and old probes more (which is less of a concern).

So, okay, I think there is no need to dig this headers topic further. It's clear we cannot do much about this. The two implementations we are using are too different and too spaced in time for further action to be possible.

bassosimone commented 2 years ago

HTTP status code (1/2)

Regarding the status code, after correcting to avoid noise caused by failures, here's what I get:

  "http_status_code/failure/new": 885,
  "http_status_code/failure/old": 5142,
  "http_status_code/match": 19030,
  "http_status_code/mismatch": 343,
  "http_status_code/total": 25400,

It may be worth taking a look into the mismatch cases and see whether there's anything fundamental here.

So, the first thing I noticed is that with the old TH there are URLs that return 403 and this becomes 200 with the new TH. An example is http://www.geocities.ws/marchrecce/, for which we have:

// old TH

  "http_request": {
    "body_length": 11784,
    "failure": null,
    "title": "Please Wait... | Cloudflare",
    "headers": {
        // snip
    },
    "status_code": 403

// new TH

  "http_request": {
    "body_length": 6573,
    "failure": null,
    "title": "Rendezvous en Lower Myanmar - An Inner Odyssey",
    "headers": {
        // snip
    },
    "status_code": 200

I created rules for both the 403 that becomes 200 and for the sub-case of Cloudflare and now I have:

  "http_status_code/403_becomes_200": 213,
  "http_status_code/cf": 6,
  "http_status_code/failure/new": 924,
  "http_status_code/failure/old": 5349,
  "http_status_code/match": 19420,
  "http_status_code/mismatch": 132,
  "http_status_code/total": 26044,

So, clearly the Cloudflare check was not very selective and the 403 => 200 was much more selective.

Okay, this is a data quality issue we're fixing by using the new TH ✔️

Now, what's next? we need to look into the remaining cases for which we have mismatch.

After more refinements, here's what I have:

  "http_status_code/200_becomes_503": 13,
  "http_status_code/308_becomes_200": 46,
  "http_status_code/403_becomes_200/cf": 6,
  "http_status_code/403_becomes_200/other": 217,
  "http_status_code/403_becomes_503/cf": 1,
  "http_status_code/403_becomes_503/other": 16,
  "http_status_code/503_becomes_200": 9,
  "http_status_code/failure/new": 935,
  "http_status_code/failure/old": 5474,
  "http_status_code/match": 19699,
  "http_status_code/mismatch": 49,
  "http_status_code/total": 26465,

Some insights based on the above results:

  1. we should retry all the cases in which 200 becomes 503 or 503 becomes 200, because this looks like a temporary failure that will go away if we retry and maybe our testing was a bit too aggressive

  2. in addition to 403 becoming 200 we also have 403 becoming 503, which are both data quality issues

  3. we still have 49 cases to dig into

One of the remaining cases (http://www.irna.ir/en/) is quite fascinating (edit: is a fundamental issue; see https://github.com/ooni/probe/issues/1824):

// old TH

{
  "tcp_connect": {
    "178.216.249.78:80": {
      "status": true,
      "failure": null
    },
    "217.25.48.64:80": {
      "status": true,
      "failure": null
    }
  },
  "http_request": {
    "body_length": 35805,
    "failure": null,
    "title": "IRNA English",
    "headers": {
      "Cache-Control": "max-age=60",
      "Content-Type": "text/html;charset=UTF-8",
      "Date": "Fri, 15 Oct 2021 03:11:46 GMT",
      "Expires": "Fri, 15 Oct 2021 03:12:06 GMT",
      "Server": "nginx",
      "Vary": "Accept-Encoding",
      "X-Cache-Status": "HIT"
    },
    "status_code": 200
  },
  "dns": {
    "failure": null,
    "addrs": [
      "217.25.48.64",
      "178.216.249.78"
    ]
  }
}

// New TH

{
  "tcp_connect": {
    "178.216.249.78:80": {
      "status": true,
      "failure": null
    },
    "217.25.48.64:80": {
      "status": true,
      "failure": null
    }
  },
  "http_request": {
    "body_length": 1918,
    "failure": null,
    "title": "\u0635\u0641\u062d\u0647\u0654 \u062f\u0631\u062e\u0648\u0627\u0633\u062a\u06cc \u0634\u0645\u0627 \u06cc\u0627\u0641\u062a
 \u0646\u0634\u062f.",
    "headers": {
      "Content-Language": "en",
      "Content-Length": "1918",
      "Content-Type": "text/html;charset=UTF-8",
      "Date": "Fri, 15 Oct 2021 03:11:34 GMT",
      "Server": "nginx"
    },
    "status_code": 404
  },
  "dns": {
    "failure": null,
    "addrs": [
      "217.25.48.64",
      "178.216.249.78"
    ]
  }
}

I'm going to try and see whether there are more cases like this. (I didn't really see this coming!)

Okay, after some refactoring, this is the final analysis for status code. I'll add some comments inline:

  // These are cases where apparently using the new TH causes failures
  "http_status_code/200_becomes_403": 8,
  "http_status_code/200_becomes_404": 2,
  "http_status_code/200_becomes_500": 1,
  "http_status_code/200_becomes_502": 1,
  "http_status_code/200_becomes_503": 13,
  "http_status_code/200_becomes_504": 1,

  // This 203 is interesting
  "http_status_code/203_becomes_504": 1,

  // These are cases where the old TH does not handle 308
  "http_status_code/308_becomes_200": 46,
  "http_status_code/308_becomes_203": 1,
  "http_status_code/308_becomes_404": 2,
  "http_status_code/308_becomes_451": 1,

  // These seem cases where the old TH causes data quality issues to new probes
  // and probably it is somehow classified as "very old client"
  "http_status_code/400_becomes_200": 5,
  "http_status_code/400_becomes_404": 1,
  "http_status_code/403_becomes_200/cf": 6,
  "http_status_code/403_becomes_200/other": 217,
  "http_status_code/403_becomes_402": 1,
  "http_status_code/403_becomes_404": 8,
  "http_status_code/403_becomes_503/cf": 1,
  "http_status_code/403_becomes_503/other": 16,
  "http_status_code/404_becomes_200": 3,
  "http_status_code/404_becomes_503": 2,
  "http_status_code/429_becomes_200": 1,
  "http_status_code/500_becomes_200": 4,
  "http_status_code/502_becomes_200": 4,
  "http_status_code/503_becomes_200": 9,
  "http_status_code/523_becomes_522": 1,

  // This is a case where the old code sets the status code after reading the body
  // while the new code distinguish between round trip and reading body. Maybe I
  // can fix this case. Maybe re-run? This is clearly the case in which there is
  // some error after the round trip...
  "http_status_code/error_reading_body": 1,

  // This tells me I need to re-run for a bunch of URLs
  "http_status_code/failure/new": 935,
  "http_status_code/failure/old": 5474,

  // These are the cases in which we get the same status code
  "http_status_code/match": 19699,
  "http_status_code/total": 26465,

All these cases are interesting and are, additionally, a source of confusion anyway for new probes. That is, looking into what is happening here was out of curiosity and for the sake of data quality mostly.

bassosimone commented 2 years ago

HTTP status code (2/2)

One of the remaining cases (http://www.irna.ir/en/) is quite fascinating:

Well, actually, no, it isn't fascinating. It's just a data quality issue with the way in which webconnectivity works. I've explained why this is a systematic issue of webconnectivity in https://github.com/ooni/probe/issues/1824.

bassosimone commented 2 years ago

Review of the results and conclusions

The objective of this comparison was to figure out whether replacing the old webconnectivity test helper (TH) with the new one would not reduce the quality of the data produced by new OONI Probes (i.e., OONI Probes using the Go engine).

The rest of this report is organised as follows. We analyse the Go client to determine what matters. We explain how run the comparison. We document actions to reduce the diff in the JSONs produced by the two THS in varying conditions. We document some oddities we observed. We show the results and comment each aspect. We draw the conclusions.

Analysis of the Go client

In this section, we read through the source code of the Go client. The objective here is to figure out what specific fields emitted by the TH matter the most to determine the blocking status of a measurement.

Because we've already discontinued the Python implementation of OONI Probe, we only focus on the Go codebase.

As of v3.10.1, the webconnectivity implementation:

  1. uses the string result from the control in dnsanalysis.go to determine DNS consistency

  2. performs HTTPAnalysis in httpanalysis.go

  3. aggregates a summary in summary.go

  4. only directly compares with the DNS flag inside the control in dnsanalysis.go and otherwise does not directly string match with the strings returned by the control

  5. incorrectly assumes that a status code equals to zero means failure whereas the original TH returns -1 (see https://github.com/ooni/probe/issues/1825)

  6. skips the body length check if the response body is truncated

  7. skips the status code check if the response is 500

  8. skips many "common" headers when comparing headers

  9. uses a 128 byte limit for extracting the <title>

  10. skips the <title> check if the body is truncated

  11. still, does not set accessible and blocking if the control HTTP request failed, which somehow protects from the above mentioned issues regarding the truncated body and the incorrect handling of the status code

  12. does not seem to investigate whether the a connect failure could have happened also in the control but this issue is mitigated again by the fact that it does not set accessible and blocking if the control request fails, still this means there is a bunch of cases in which we could say there's consistency with respect to the control and we don't

  13. when there is no DNS, TCP/IP or TLS blocking, falls back to complicated heuristics that maybe could be made simpler by not processing ex post the full results but rather by using a model where we determine blocking right after each measurement step (this is also a good design hint for websteps, if we're able to do that)

Because of the above facts, it seems it's really important for us to make sure we don't have DNS discrepancies between what the old TH and the new TH do DNS-wise. Other aspects are comparatively less important to equalize.

Methodology

I've run repeatedly the old and the new TH with a nearly full version of the test list. To this end, I used the code at https://github.com/ooni/wcthcheck. Many commits in such a repository mentioned this issue. Also, there are comments in this issue explaining how I run the whole experiment. In a nutshell: I run the two THs on a Linux box, side by side and accessed them using oohelper, the TH client helper included in github.com/ooni/probe-cli.

I've also tried to use Jafar to provoke specific errors and see how the two test helpers reacted to them: https://github.com/ooni/probe/issues/1707#issuecomment-942283746.

Actions to reduce the diff between the new and the old TH

The experiments I've run led to the following PRs that attempted to reduce the diff between the old and the new TH

  1. https://github.com/ooni/probe-cli/pull/504

  2. https://github.com/ooni/probe-cli/pull/541

  3. https://github.com/ooni/probe-cli/pull/542

  4. https://github.com/ooni/probe-cli/pull/543

  5. https://github.com/ooni/probe-cli/pull/545

  6. https://github.com/ooni/probe-cli/pull/546

Oddities

Performing this analysis, I came across these interesting oddities that may have a data quality impact:

  1. https://github.com/ooni/probe/issues/1780

  2. https://github.com/ooni/probe/issues/1826

  3. https://github.com/ooni/probe/issues/1823

  4. https://github.com/ooni/wcthcheck/issues/1

  5. https://github.com/ooni/probe/issues/1824

Results

Here we classify by "phase" of the experiment: DNS, TCP/IP, and HTTP.

category mismatches total percentage
dns 49 27873 0.17%
http_body_length 10961 26405 41%
http_failure 6608 26405 25%
http_headers ... ... ...
http_status 6766 26405 25%
http_title 13862 26405 52%
tcp_connect ... ... ...

The total number of URLs was 27971. We're missing 16 URLs with both, 1 is only missing with the new TH, and 81 are only missing with the old TH. Because the result encompass DNS, TCP/IP, and HTTP, re-running a measurement could in some cases cause failures in one aspect (e.g., DNS) while fixing other aspects in others (e.g., HTTP): (as we already know) the test lists contain many not-so-reliable websites.

DNS

So, regarding the DNS we are in a very good position and we only have really few differences. The full classification of the 49 DNS mismatches between new and old TH is the following:

  1. in one case the comparison failed because we could not figure out the IP's ASN, making comparison unfeasible. It turns out it's an issue of the GeoIP database we're using. The IPs in this result actually belong to Amazon's CloudFront.

  2. in 19 cases the new TH failed and in 9 cases the old TH failed, making comparison unfeasible. It's difficult to say whether the new TH is more likely to have DNS failures. However, https://github.com/ooni/probe/issues/1823 suggests that we should probably increase the timeout of the new codebase a bit to give the upstream server we're using time to reply with SERVFAIL.

  3. in 9 cases the failure is different between the old and the new TH and in 11 cases the failure is flagged as other. I have not fully investigated these cases yet and some of them could just be limitations of the classification where you end up flagging something as other having excluded all the other known possibilities.

It's also worth noting that we have 17 cases in which the old TH could not handle the URL containing an IP address. They are not flagged as failures, rather this is a known bug, so it goes into a specific, distinct bucket.

HTTP body length

The body length is complicated because we need to see the body but the THs generally do not return it. We have seen that the old TH sometimes encodes HTML tags, which may be a data quality issue because the new probe doesn't do it. See https://github.com/ooni/probe/issues/1707#issuecomment-943290365 for more details on how I learned about this encoding issue.

HTTP failure

5574 of the 6608 failure mismatches occur because the new TH failure is null and the old TH failure is not null. The converse happens in 935 cases out of 6608. This seems to suggest that the new TH is more reliable. (Empirically, I also observed that, while the measurements were progressing, there were more failures with the old TH, since the number of missing measurements with the old TH was generally larger than the ones with the new TH.)

(Ah, and I also showed in https://github.com/ooni/probe/issues/1707#issuecomment-942283746 that the old TH does not validate TLS certificates, which is a potential data quality issue and source of differences: the Go probe will fail in this case.)

HTTP headers

I basically gave up comparing headers. We need to exclude the common set of headers that the probes ignore. The remaining headers are quite dependent on the two implementations being very different (the old TH uses Twisted 13.2.0, which is software released in November 2013). I tried to add more headers to the ignore list, which often led me to an empty intersection between headers. I took this fact as a signal that investigating headers differences was a waste of my time. That said, the Go probe uses the same codebase of the new TH, so using the new TH seems better here.

HTTP status code

The status code matched in 19699 cases out of 26465. This leaves us with 6766 mismatches. Of these, 6409 are caused by the old TH failing for HTTP (5474 cases) or the new TH failing (935 cases). The remaining cases are cases in which the status code changes by changing TH. I also showed a case, http://www.irna.ir/en/, in which the result is purely dependent on the IP address chosen by the specific TH to fetch the webpage: not following all endpoints has a data quality cost (https://github.com/ooni/probe/issues/1825).

HTTP title

When discussing the title, it's important to remember that the Go probe and the new TH use a similar codebase, therefore they should have more consistent results in this respect. I learned that the old TH strips non-latin characters from the title itself, which leads to it producing a smaller (6451 cases) or empty title (6309 cases). Summing these two cases we obtain 12760, which is most of the 13862 mismatching cases. There are also few cases where the new TH title is smaller or empty, which likely depend on the regexp used to extract the title (which is limited to a maximum number of characters in the Go codebase) or on the page size. Either way, doing something similar to what the probe does will be better.

TCP connect

I didn't spend much time looking into this topic. We don't directly string match strings from the TH in the probe and I had already gained confidence that we can handle the common cases using Jafar.

Conclusions

I think it's fine to switch to the new TH. We will have to fix some small bugs, chiefly that our DNS timeout seems to small in one specific case and we cannot properly get a SERVFAIL (https://github.com/ooni/probe/issues/1823). We have spent lots of time trying to reduce the difference in the JSON structure and that DNS looks similar. We should fix new Go clients to always perform the body length comparison (now both bodies should be truncated) as documented in https://github.com/ooni/probe/issues/1827. We should fix the new Go clients to properly handle the status code value as documented in https://github.com/ooni/probe/issues/1825. We should have issues for each possible data quality issue anyway and, on top of this, maybe we need a periodic process for monitoring the TH quality (https://github.com/ooni/probe/issues/1829).

bassosimone commented 2 years ago

We concluded with @hellais that it's okay to replace the old TH with the new TH. The most immediate next step was to create a release of oohelperd that was suitable for doing that, which I did in https://github.com/ooni/probe/issues/1841.

It remains to figure out what are the remaining issues (including addressing issues discovered as part of this one). But, we can also safely close this issue.