ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
758 stars 142 forks source link

torsf: data quality issues #2063

Open hellais opened 2 years ago

hellais commented 2 years ago

I am doing a bit of digging into the torsf experiment results to validate the quality of the data and will document here some of the issues I encounter.

Unless otherwise specified I am using a sample of 5800 measurements from 2021-01-01 until 2021-03-28.

Missing tor_logs key

There are several measurements that have an empty tor_logs keys, yet the bootstrap time is set to a smallish number.

In my sample I encountered this issue in 601 measurements.

Here is a sample measurement showing this problem: https://explorer.ooni.org/measurement/20220315T163942Z_torsf_IL_8551_n1_Ph5BqdODbRTImFF4

The software_name for these is exclusively ooniprobe-desktop and ooniprobe-cli.

bootstrap_time set to 0

In encountered a total of 1190 measurements that have the bootstrap_time key set to the value of 0, although it's clearly non-null from the value of the tor_logs list.

Here is a sample measurement: https://explorer.ooni.org/measurement/20220315T163627Z_torsf_TR_47524_n1_DLTJY7TM2wdcoS4Z

The software names for this are: 'ooniprobe-desktop', 'ooniprobe-android', 'ooniprobe-cli', 'miniooni', 'ooniprobe-android-dev-debug', so it's not isolated to a specific platform.

unknown_failures

There are a total of 182 unknown failures in the measurements.

Related to control port communication issues: https://explorer.ooni.org/measurement/20220308T055551Z_torsf_SE_8473_n1_6lloCUgQ1FMmQI1c

36 measurements for this error and all with the software_name ooniprobe-desktop

Related to not being able to find the tor executable in the path: https://explorer.ooni.org/measurement/20220321T085633Z_torsf_FI_1741_n1_VkJcBmILz5INHAxP

97 measurements present this issue that have software_name either ooniprobe-desktop or ooniprobe-cli

The platforms for which this happens are: 'linux', 'openbsd', 'unknown', 'windows'

bassosimone commented 2 years ago

I have also seen this:

Apr 29 11:40:44.000 [notice] Bootstrapped 0%!((MISSING)starting): Starting

🤦 😢 😞

(investigation happens)

Ah, okay: this is a problem caused by how we print output (in cmd/internal/output/output.go):

// MeasurementJSON prints the JSON of a measurement
func MeasurementJSON(j map[string]interface{}) {
    log.WithFields(log.Fields{
        "type":             "measurement_json",
        "measurement_json": j,
    }).Info("Measurement JSON")
}

I think this code is wrong. We should just fmt.Printf("%s\n", string(serializedJSON)). What we're doing there is not good because we're basically using the JSON as a format string. It may be worth further digging whether the problem occurs in apex/log or in our wrappers around apex/log. It may highlight further underlying issues.

So, in conclusion, this problem is not related to the original issue and it requires a new issue describing it.

(Created issue at https://github.com/ooni/probe/issues/2082)

bassosimone commented 2 years ago

Based on a conversation with @meskio we had today, this issue should depend on https://github.com/ooni/probe/issues/2017

PoleTransformer commented 2 months ago

Current Tor Snowflake measurements are unusable. There is a large number of anomalies even in uncensored countries: https://explorer.ooni.org/chart/mat?since=2024-07-08&until=2024-08-08&time_grain=day&axis_x=measurement_start_day&test_name=torsf

Started on September 20, 2023 because the domain front cdn.sstatic.net switched to CloudFlare: https://forum.torproject.org/t/problems-with-snowflake-since-2023-09-20-broker-failure-unexpected-error-no-answer/9346

And this matches with the MAT data as well, anomalies starting on the 20th: https://explorer.ooni.org/chart/mat?since=2023-09-19&until=2024-08-08&time_grain=day&axis_x=measurement_start_day&test_name=torsf

I see that the domain front is hard coded: https://github.com/ooni/probe-cli/blob/master/internal/ptx/snowflake.go The last version of probe-cli that works is: v3.19.0 The domain front is foursquare.com, which for now is still hosted by Fastly. To prevent these problems in the future, would it be possible to implement something to fetch the domain front from a centralized server like ooni API? This allows updates to the domain front as hosting providers change in the future.

There are still thousands of false positives being submitted to the MAT to this day. Although the test is "disabled", users can still overwrite the check with an environment variable. Could the server block any submissions where the probe version is not high enough(<v3.19.0)?

meskio commented 2 months ago

I see that the domain front is hard coded: https://github.com/ooni/probe-cli/blob/master/internal/ptx/snowflake.go The last version of probe-cli that works is: v3.19.0 The domain front is foursquare.com, which for now is still hosted by Fastly. To prevent these problems in the future, would it be possible to implement something to fetch the domain front from a centralized server like ooni API? This allows updates to the domain front as hosting providers change in the future.

There is an API with the latest snowflake bridge line that the ooni API could use: https://bridges.torproject.org/moat/circumvention/builtin