ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
749 stars 142 forks source link

webconnectivity: server-side blocking limitations #2661

Open bassosimone opened 5 months ago

bassosimone commented 5 months ago

This issue exists to document a methodological limitation of Web Connectivity. In writing this issue, I am focusing on v0.5 of Web Connectivity, but the issue exists also for v0.4. To explain this issue we need to define server-side blocking first.

We define as server-side blocking (SSB) the situation where either the probe and the test helper's (TH) HTTP response differ according to the HTTP diff algorithm (which checks status code, intersection of uncommon headers, distinct long words in the title, and body length) and the different response is not caused by HTTP interference but rather by the server choosing to block either the probe or the TH. A classical example of SSB is with Cloudflare's CAPTCHAs, as shown by https://github.com/ooni/probe/issues/1734.

Investigating this kind of blocking is a non-goal for Web Connectivity v0.5, though it is something we would like to investigate either in the future (e.g., with v0.6) or using the backend rather than flagging SSB directly in OONI Probe.

The https://github.com/ooni/probe-cli/pull/1476 PR introduces test cases showing what v0.5 can do in case of SSB:

  1. in case the request URL is http://, v0.5 would report blocking = "http-diff", accessible = false;
  2. for https:// URLs it says blocking = false, accessible = true but still records that there's an HTTP difference.

This behavior stems from the fact that https:// URL are automatically flagged as not censored if we get a webpage back regardless of the content of the page itself. For http:// URLs we fallback to the HTTP diff algorithm.

When testing http:// URLs, Web Connectivity v0.5 also performs TLS handshakes with the same address on port 443/tcp, to check whether it is valid for the domain. However, AFAICT, we cannot use this information to say that the webpage returned in case of HTTP diff is good because there could be a transparent HTTP proxy for HTTP only. For this reason, I think the proper option for detecting these cases (should we want to do that in the probe) is to have signatures for well known pages (e.g., Cloudflare).