Closed bassosimone closed 8 months ago
We already have an issue where we describe hopping between failure and success depending on the TH we use: https://github.com/ooni/probe/issues/2298.
We also have fixed the issue for webconnectivity v0.5.
So, I am going to flag this issue as fixed for webconnectivity v0.5 and we'll close this issue when all users will be using this version of webconnectivity.
Okay, we have fixed this in Web Connectivity LTE, hence we can close this issue. We have a specific test case for that that runs for every commit and ensures we keep detecting this common case.
Let's start from the obligatory MAT search:
We see that since recently we have had a growth in the fraction of anomalies and failures. To start getting some confidence about what is going on there, let's inspect three recent measurements:
a successful measurement using v0.4.0
a failed measurement using v0.4.1
an anomalous measurement using v0.4.0
First measurement
The first measurement of the pack clearly failed inside the control:
However, the algorithm for computing the summary determines that all is good:
Why is that? Is this a bug in the probe? Well, if we look at the TLS handshakes for the probe we see:
So, it seems in this case we used an IP address where we can TLS handshake. Also, if we look at HTTPS:
Which means the HTTPS request did not fail.
So, now it should be more clear what we see if we notice that
x_status
is1
, which meansStatusSuccessSecure
(see i/e/e/w/summary.go). This state is an optimization inside v0.4 where we always consider successful a measurement using HTTPS where we can perform a TLS handshake and fetch data regardless of what happened in the TH.Second measurement
The second measurement has
x_status
equal to16
, which meansStatusAnalysisControlFailure
(see i/e/e/w/summary.go). So, this seems also an acceptable outcome. Because the control is failed, we flag the measurement as errored. We could of course do better, like recognizing that the website is down if both the control and the probe failed in the same way.Third measurement
This measurement actually looks good. The control measurement succeeded (WTF?!) while the probe failed because of
ssl_invalid_hostname
. So, the measurement itself is flagged as anomaly. And, yeah, this feels like an anomaly.Preliminary conclusion for v0.4
I think this issue is worth more investigation, but time is limited. So, for now, let's spell out a preliminary conclusion.
We think that what is happening here is that globally only some addresses are configured to correctly use the SNI that lives inside the URL. This means that probes and THs could fail or succeed depending on what happens to them, which in turn depends on which connections the DNS gives to them. I think all the probes here behave reasonably, given the circumstances.
It would be cool to introduce inside v0.5 a state where we recognize the TLS is misconfigured for both the probe and the TH and hence say something along the lines of "website down".
v0.5 measurements
We have fewer v0.5 measurements at the moment. Here's one run by me a few days ago.
This measurement falls into the case where both the TH and the probe fail. So, it's a good candidate for detecting this corner case and declaring that the website is down because of TLS misconfiguration.
First measurement
The first v0.5 measurement I'd like to show you is a case of anomaly. For some reason (maybe my IP address changed at last?) I am now using
1.th.ooni.org
rather than0.th.ooni.org
by default.So, in the control the TLS handshake succeeds:
However, the probe sees this:
And so it seems justified to say:
Second measurement
I ran another measurement forcing the backend to be
0.th
.Here's the control part of the TLS handshake:
Here's the probe's part:
So, with the new patch I'm working on (and will commit soon) we obtain:
Conclusion
We can surely mitigate the issue on the probe side by saying "website down" rather than "measurement failed" in the cases in which it makes sense to do so. I am still a bit uneasy about the same TCP endpoint providing different results regardless of what TH is measuring, which probably hints at anycast or region specific configurations. What I am trying to say here is that this
bing-amp.com
URL inside the test list will increase the noise we see. More noise creates more possibility for errors when analyzing the data. It would be interesting to discuss with @hellais whether and how this could change if we adopt the https://github.com/ooni/data approach as a foundation for measurements.