webconnectivity: handle TLS misconfigured websites

bassosimone commented 2 years ago

Let's start from the obligatory MAT search:

We see that since recently we have had a growth in the fraction of anomalies and failures. To start getting some confidence about what is going on there, let's inspect three recent measurements:

First measurement

The first measurement of the pack clearly failed inside the control:

However, the algorithm for computing the summary determines that all is good:

Why is that? Is this a bug in the probe? Well, if we look at the TLS handshakes for the probe we see:

So, it seems in this case we used an IP address where we can TLS handshake. Also, if we look at HTTPS:

Which means the HTTPS request did not fail.

So, now it should be more clear what we see if we notice that x_status is 1, which means StatusSuccessSecure (see i/e/e/w/summary.go). This state is an optimization inside v0.4 where we always consider successful a measurement using HTTPS where we can perform a TLS handshake and fetch data regardless of what happened in the TH.

Second measurement

The second measurement has x_status equal to 16, which means StatusAnalysisControlFailure (see i/e/e/w/summary.go). So, this seems also an acceptable outcome. Because the control is failed, we flag the measurement as errored. We could of course do better, like recognizing that the website is down if both the control and the probe failed in the same way.

Third measurement

This measurement actually looks good. The control measurement succeeded (WTF?!) while the probe failed because of ssl_invalid_hostname. So, the measurement itself is flagged as anomaly. And, yeah, this feels like an anomaly.

Preliminary conclusion for v0.4

I think this issue is worth more investigation, but time is limited. So, for now, let's spell out a preliminary conclusion.

We think that what is happening here is that globally only some addresses are configured to correctly use the SNI that lives inside the URL. This means that probes and THs could fail or succeed depending on what happens to them, which in turn depends on which connections the DNS gives to them. I think all the probes here behave reasonably, given the circumstances.

It would be cool to introduce inside v0.5 a state where we recognize the TLS is misconfigured for both the probe and the TH and hence say something along the lines of "website down".

v0.5 measurements

We have fewer v0.5 measurements at the moment. Here's one run by me a few days ago.

This measurement falls into the case where both the TH and the probe fail. So, it's a good candidate for detecting this corner case and declaring that the website is down because of TLS misconfiguration.

First measurement

The first v0.5 measurement I'd like to show you is a case of anomaly. For some reason (maybe my IP address changed at last?) I am now using 1.th.ooni.org rather than 0.th.ooni.org by default.

So, in the control the TLS handshake succeeds:

However, the probe sees this:

And so it seems justified to say:

[      2.712213] <warn> TLS: endpoint 204.79.197.219:443 is blocked (see #4): ssl_invalid_hostname
[      2.712223] <warn> ANOMALY: flags=4 accessible=false, blocking=http-failure

Second measurement

I ran another measurement forcing the backend to be 0.th.

Here's the control part of the TLS handshake:

Here's the probe's part:

So, with the new patch I'm working on (and will commit soon) we obtain:

[      0.977307] <info> website likely down: all TLS handshake attempts failed for both probe and TH
[      0.977323] <info> WEBSITE_DOWN_TLS: flags=0, accessible=false, blocking=false

Conclusion

We can surely mitigate the issue on the probe side by saying "website down" rather than "measurement failed" in the cases in which it makes sense to do so. I am still a bit uneasy about the same TCP endpoint providing different results regardless of what TH is measuring, which probably hints at anycast or region specific configurations. What I am trying to say here is that this bing-amp.com URL inside the test list will increase the noise we see. More noise creates more possibility for errors when analyzing the data. It would be interesting to discuss with @hellais whether and how this could change if we adopt the https://github.com/ooni/data approach as a foundation for measurements.

bassosimone commented 1 year ago

We already have an issue where we describe hopping between failure and success depending on the TH we use: https://github.com/ooni/probe/issues/2298.

We also have fixed the issue for webconnectivity v0.5.

So, I am going to flag this issue as fixed for webconnectivity v0.5 and we'll close this issue when all users will be using this version of webconnectivity.

bassosimone commented 8 months ago

pRr9-V

Okay, we have fixed this in Web Connectivity LTE, hence we can close this issue. We have a specific test case for that that runs for every commit and ensures we keep detecting this common case.

ooni / probe