ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
51 stars 29 forks source link

Setup a sync for the NDT and DASH network measurement data #180

Open agrabeli opened 7 years ago

agrabeli commented 7 years ago

So it's exciting that NDT and DASH have been rolled out and integrated into OONI Probe mobile apps, and that we're collecting loads of network measurements from these tests on a daily basis.

But how will this data be analyzed? How will we abstract value from it? What is the data analysis methodology?

I'm opening this ticket to encourage brainstorming and discussions around the creation of a methodology to analyze the incoming NDT and DASH network measurement data.

We may be interested in creating a data analysis methodology geared towards (if possible):

If we are able to successfully develop a methodology for examining and detecting the above, we would be in a unique position to support internet freedom community members who are working on these issues.

cc' @bassosimone @darkk @hellais

hellais commented 5 years ago

This is something which is also very important for OONI Explorer.

The current status is that in a one-off fashion data from the MLab tests were ingested into the metadb, but they are going to soon start becoming stale and old.

If you look at the #mlab channel backlog you can see what has been done so far, but I will try to recap some of it here.

The easiest way to ingest MLab measurement data is by using the bigquery tables and running queries like: https://github.com/m-lab/mlab-vis-pipeline/tree/master/dataflow/data/bigtable/queries.

In particular the queries we are mostly interested in are something along the lines of:

#legacySQL
SELECT client_asn_number,
       client_asn_name,
       client_country_code,
       SUM(rtt_sum) / SUM(rtt_count) AS rtt_avg,
       AVG(packet_retransmit_rate) AS retransmit_avg,
       nth(51, quantiles(download_speed_mbps, 101)) AS download_speed_mbps_median,
       nth(51, quantiles(upload_speed_mbps, 101)) AS upload_speed_mbps_median,
       COUNT(*) AS count
FROM [mlab-oti.data_viz.all_ip_by_hour]
WHERE LENGTH(client_asn_name) > 0
 AND LENGTH(client_asn_number) > 0
 AND local_test_date >= '2018-03-01 00:00:00' 
 AND local_test_date < '2018-04-01 00:00:00'
GROUP BY client_asn_number,
client_asn_name,
client_country_code
ORDER BY client_country_code;

I did this manually at some point and populated the table of the MetaDB called ooexpl_netinfo.

What needs to happen is setup some form of system that automatically runs this sync periodically so that the data stays fresh.

I have a google account which has query permission on the BigQuery table and confirmed with the folks of MLab that it is fine if we run the above query on their dataset every 24h.

@FedericoCeratto any thoughts on this and where we can put in this logic?

hellais commented 4 years ago

This is also related to: https://github.com/ooni/backend/issues/140

hellais commented 3 years ago

I reached out to M-Lab about how we could go about doing this integration. They now have a new statistics backend which is documented here: https://github.com/m-lab/stats-pipeline/#statistics-pipeline-service.

This looks easier to do than integrating queries with big query. Here are some questions which I asked the M-Lab folks with their answers:

Could we query the https://statistics.measurementlab.net/ endpoint directly from OONI Explorer when the country page is loaded or are there load issues we should take into account?

OONI Explorer could load data from various endpoints on load, though you will likely want to experiment with load times and responsiveness, and perhaps cache statistics periodically. The same statistics are also available in BigQuery tables, and could be useful if you decide page load responsiveness is too slow.

Would it be possible to remove the identifier from the API endpoints and allow querying directly by ISO3166-2 country codes?

Removing continent_code isn't likely since the API is basically presented as a series of files following a well defined URL structure. You could likely get around this by using queries in BigQuery, but the request from OONI Explorer would need to include the continent code / country code / pattern preceding the ISO3166-2 region codes

Do you plan to also expose aggregate statistics for a given calendar year (i.e. the median download & upload speed for a given country in a given year)?

The output format consists of a full year (or partial year for the current year) of per-day aggregations. For each aggregate geography such as ISO3166-2 country/region codes, there are 8 rows per day. The 8 rows provide the frequencies of upload & download tests that fell within 8 log scale buckets, to be used to show a histogram of tests in an aggregation on that day. Additionally, each of the 8 rows contains the descriptive statistics for that day overall. This example visualization for United States Counties might help in showing how we're using this data format now. You can specify at each aggregation level. Currently 2019, 2020, and 2021 are available. For example, for Italy we could load 2019 with https://statistics.measurementlab.net/v0/EU/IT/2019/histogram_daily_stats.json or if you wanted statistics for one of Italy's ISO3166-2 first level in-country regions: https://statistics.measurementlab.net/v0/EU/IT/IT-45/2019/histogram_daily_stats.json and for a particular city within a region: https://statistics.measurementlab.net/v0/EU/IT/IT-45/Formingnana/2019/histogram_daily_stats.json

Is there some way, given a country, to obtain the statistics broken down by network (ASN)?

Yes! ASN is included at each supported level, and at the global level. For the examples above, getting per-ASN statistics in each geography would look like: https://statistics.measurementlab.net/v0/EU/IT/asn//2019/histogram_daily_stats.json or if you wanted statistics for one of Italy's ISO3166-2 first level in-country regions: https://statistics.measurementlab.net/v0/EU/IT/IT-45/asn//2019/histogram_daily_stats.json and for a particular city within a region: https://statistics.measurementlab.net/v0/EU/IT/IT-45/Formingnana/2019/histogram_daily_stats.json -- ASN aggregation at the global level is also available: https://statistics.measurementlab.net/v0/asn//2019/histogram_daily_stats.json