ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
49 stars 29 forks source link

Next steps in terms of data analysis #170

Open agrabeli opened 7 years ago

agrabeli commented 7 years ago

It would be great if the pipeline could perform the following (and expose it on OONI Explorer):

1. Separate "normal" from "anomalous" measurements. Currently, on OONI Explorer, users can find a mixed dump of both "normal" and "anomalous" measurements per country (tagged in green or red). It would be useful if such measurements could be separated, so that users could choose to only look at "anomalous" measurements. Perhaps we can add a filter to toggle anomalous measurements?

2. Filter web connectivity "anomalous" measurements based on the types of anomalies (DNS, HTTP-diff, HTTP-failure, TCP/IP). Each web connectivity "anomalous" measurement (on OONI Explorer) includes information on what type of specific anomaly was detected. It would be great if, for example, OONI Explorer included an extra filter to select web connectivity anomalous measurements (per country) based on the 4 types of anomalies.

3. Aggregate the sum of anomalies per measurement. When block pages are not detected, it's useful to look at the sum of anomalies per measurement, as URLs (for example) that present the highest amount of anomalies are more likely to be blocked (though this is not always the case). Perhaps each measurement could include the total amount of anomalies that it has presented across its testing period? And perhaps web connectivity measurements could be listed, starting from the ones with the highest amount of anomalies?

4. Include the testing frequency per measurement. It's useful to know when sites presented anomalies across time. For example, if a site only presented an anomaly once yesterday, but multiple times 2 years ago, then it's more likely that it was blocked two years ago than yesterday. Therefore I think it would be good to include the frequency of anomalies per measurement across time.

hellais commented 4 years ago

This is to some extent done, but it's useful knowledge for @FedericoCeratto I think.

bassosimone commented 4 years ago
  1. Filter web connectivity "anomalous" measurements based on the types of anomalies (DNS, HTTP-diff, HTTP-failure, TCP/IP). Each web connectivity "anomalous" measurement (on OONI Explorer) includes information on what type of specific anomaly was detected.

I would encourage us to start to think about the anomalies attached to a measurement, rather than about the anomaly. We are approaching the point where we are able to say that, e.g., blocked.com is blocked both using the DNS and TLS. I would content that reducing this info to say that blocked.com is blocked by DNS, or blocked by TLS is losing information.