Add support for outage detection

hellais commented 4 years ago

It was brought up on the #ooni channel by @carrotcypher that we may want to have some sort of very minimal probe which does some low throughput operation over time and sends a signal in case it's unable to do it.

This is similar to what we were discussing in Investigating Internet Blackouts. From the edge of the network, where we said:

Outage detection
As a first step we need to have some form of heuristic that allows us to understand that a
particular device is experiencing some form of network outage. This can be used as an indicator
to then trigger more fine-grained and in-depth measurements.
Since this need to be done with a fairly high frequency, it’s crucial that what we do to detect
an outage consumes minimal amounts of network bandwidth and that we reserve the most
bandwidth intensive measurements for the follow-up stage.
Each attempt to fetch some minimal document from HTTPS server is ~6KiB of data sent over
the wire: DNS for A and AAAA, TLS handshake and teardown. That’s ~17MiB a month if the test is
done every ~15 minutes. The value of 15min comes from minimal inexact interval supported by
Android’s AlarmManager.4
Failures should trigger follow-up measurements to ensure that it’s something that looks like
a blackout and not just a temporary OONI service failure or blockage, last-mile failure (heavy
wifi interference, or broken LAN switch, CPE failure), ISP subscription termination (e.g. quota
depletion) or network glitch.

It is quite challenging to do this on mobile, as there are battery consumption constraints to take into account, however with the new desktop app maybe we can do something useful in there.

carrotcypher commented 4 years ago

The idea I have is something I'll be referring to as a signal canary, where rather than intermittent pings (which act more like an echo beacon), it would mimic a warrant canary — where a change in state is considered to be information itself. This means rather than incremental connections to test connectivity, the signal canary would maintain a constant connection to other signal canaries over a p2p network.

In this model, we can imagine canaries in each country, run by individuals and institutions, all connecting to each other to establish themselves as valid canaries. Once a significant amount of established signal canaries disconnect from a specific country in a time frame and pattern consistent with network outages and blocks, the rest of the signal canaries would sound an alarm, optionally then publishing that alarm to OONI somehow for further investigation. Important to note that this would have some false positives, not automatically signifying blocks, outages, or censorship at play, but due to the nature of an existing established connection dropping rather than simply not being able to be established, the signal to noise ration would be much higher in this model.

As this model functions by requiring solid connections and potentially reading disconnections as data itself, it is not ideal for a mobile platform implementation. Desktop users (either as a daemon or a browser plugin) on the other hand, might be ideal. Additionally, while it is designed with scalability through community support in mind, the boostrap connections could always be the usual institutional supporters in different countries, provide a strong level of actionable data without any additional community participation necessary.

Comments and criticisms to this idea are welcome.

bassosimone commented 4 years ago

I think it’s a very good idea 👍

carrotcypher commented 4 years ago

I'm looking at Noise[manual][code] right now for a proof-of-concept. It already has all the p2p functionality signal canary would need and is basically a p2p chatroom (as seen below in GIF), so that channel could be used to intermittently broadcast to the channel the node's signed proof-of-freshness and optionally identity too.

1_pnGLLKHJnM8ObccwnrkRDg

As all participants in the channel would effectively see JOIN/QUIT, you could simply store (and optionally routinely prune) a db of active connections with their geolocation data (either provided, sourced from geolocation service, or redundantly both), then have code that triggers events based on specific criteria such as:

too many connections disconnecting from the same region in a specific time frame
all users disconnecting from the same region
etc

I'll try to put together a proof-of-concept and share it here.

carrotcypher commented 4 years ago

Problems encountered going this direction:

Too many connections needed? Initially, I was thinking that all the nodes needed to do was to connect to each other. Eventually this would end up as potentially 2000 nodes all connecting to each other (quite heavy and unnecessary). Assuming this is not ideal, then we have to take the direction of bootstrapping and maintaining only a minimum amount of connections — for this example, let's say 8.
Can't trust the data? For the model where each connection is only connecting to 8 or so nodes at a time, you have the issue of trusting that the data of JOINS/QUITS being relayed on behalf of others is in fact true and not being generated or censored in any way. (A malicious node could simply not relay important data, or provide fake data of JOINS from a country being blocked as counter-intelligence).
Middle ground? 8+8 The above two issues may be resolved through the principle of trust, but verify. If a new IP JOINs, you connect to it momentarily to verify it is in fact part of the legitimate signal canary network, disconnecting after. Additionally you maintain an additional 8 connections that are periodically cycling through recorded IPs. This is the verification step, and can be used to combat against censorship by your first 8, but also to confirm that an entity is in fact unreachable.
Additional considerations for the database? While the 8+8 solution may solve the problem of not getting the real picture, it doesn't solve the problem of making sure everyone has the same picture. This can be done by sending a hash of your current database with proof of freshness, and anyone who is lagging requests updates based on their last hash. This is similar to how blockchains verify (block height), but there's absolutely no need to involve actual blockchains in this, rather just simple cryptography to prove data sanity and freshness.

Will be taking this middle ground approach of 8 static and 8 cycling for further testing. Comments and criticisms welcome.

carrotcypher commented 4 years ago

As of posting this (and unless otherwise updated below), I've put development of the proof-of-concept on hold for the time being as I don't have the time to contribute to it at the moment. If anyone else would like to take a stab at it, I'd be happy to share notes and discuss what I've already come to understand about the limitations, issues, and functionality ideas.

ooni / probe

Add support for outage detection #894