Improve availability of our infrastructure aka "oonification of probe services"

hellais commented 5 years ago

We have been working on doing a bit of research into how we can improve the availability of the OONI probe services (PS) infrastructure (bouncer, collector, registry, orchestrate, test-helpers).

In this issue I will document where we are at and what the next steps to roll this out should be.

The core problem is:

We have some bits of OONI infrastructure that if they go down, we lose OONI data: the collector and the bouncer. How do we solve this problem?

In order to address this we have come to the conclusion that it's best to drastically simplify our infrastructure and run all services required by probes (bouncer, collector, registry, orchestrate, test-helpers) on a single VM one for each of the supported locations: Miami (MIA), Amsterdam (AMS), Hong Kong (HKG).

The probe logic should also be adjusted to try, when attempting to communicate with a probe service, in order: ANYCAST DNS, DNS, IP ADDRESS and then as a last resort use some form of circumvention tool.

This should also lead to a lowered RTT, making test runs faster and also to an increased reliability.

In practice we need to look at all of the following services and find a way to make them work nicely on a single host:

bouncer
collector
registry
orchestrate
web_connectivity test-helper

In front of these will be placed an nginx reverse proxy which shall map all requests to the required backends. This requires looking at each of these components and ensuring that there are no conflicts with the API endpoints.

For naming I propose we call the single host entry point as:

hkg-ps.ooni.io
ams-ps.ooni.io
mia-ps.ooni.io

Where PS stands for Probe Services.

I did some back of the envelope calculations to ensure the VMs aren't currently requiring soo much resources that it's not possible to run them on a single host:

  | Average Memory usage | In MB | CPU Usage
-- | -- | -- | --
hkgcollectora.ooni.nu | 347299076.5 | 331.2102094 | 2.06440678
miacollector.ooni.nu | 765043157.8 | 729.6020105 | 1.916949153
c.collector.ooni.io | 410075891 | 391.078845 | 3.445762712
hkgbouncer.ooni.nu | 384875954.9 | 367.0463132 | 4.359646238
  |   |   |  
c.web-connectivity.th.ooni.io | 601676700.9 | 573.8036164 | 5.168965517
hkgwebconnectivitya.ooni.nu | 113963637.3 | 108.6841939 | 3.108474576
b.web-connectivity.th.ooni.io | 530451649.9 | 505.8781146 | 1.269491525
  |   |   |  
events.proteus.ooni.io | 198639528.5 | 189.4374166 | 0.9949152544
registry.proteus.ooni.io | 178610018.4 | 170.3357872 | 2.050847458
db-1.proteus.ooni.io | 225076679.8 | 214.6498488 | 1.867463957
proteus.ooni.io | 205543800.7 | 196.0218436 | 1.022033899
notify.proteus.ooni.io | 194867436.1 | 185.8400689 | 0.8169491526
  |   |   |  
  |   | 3963.588268 | 28.08590622

And it seems like we are in pretty good shape.

The total memory usage of all the hosts:

hkgcollectora.ooni.nu miacollector.ooni.nu c.collector.ooni.io hkgbouncer.ooni.nu c.web-connectivity.th.ooni.io hkgwebconnectivitya.ooni.nu b.web-connectivity.th.ooni.io events.proteus.ooni.io registry.proteus.ooni.io db-1.proteus.ooni.io proteus.ooni.io notify.proteus.ooni.io

Is less than 4GB average in a 24h window and the CPU usage is pretty low as well.

@FedericoCeratto @bassosimone what do you think, did I miss something?

SuperQ commented 5 years ago

Something like Consul can be used to register available endpoints and do simple health probes. From there it can dynamically give DNS results based on what's currently healthy.

hellais commented 5 years ago

As part of this I am exploring the various options that are available for moving our DNS some place else.

We currently are using DNS provided to us by namecheap our registrar.

The kind of feature set I am looking for when evaluating some options are:

It MUST allow us to do some level of location based routing to backend services (i.e. probes should be given addresses that are closest to them)
It MUST have some sort of API that allows us to easily sync our DNS zones and export them
It MUST be reasonably priced, by some definition of reasonably
It MUST work for users which are located in countries affected by US trade sanctions (ex. Iran, Cuba, etc.)
It SHOULD be a hosted service
It SHOULD allow us to re-route client in case of some bits of infrastructure going down (similar to what @SuperQ what suggesting with consul)
It SHOULD support ANYCAST DNS
It SHOULD be possible for us to migrate to some other service without too much effort if we desire to do so (no vendor lock-in)
It SHOULD support the letsencrypt DNS-01 type challenge (see: https://letsencrypt.org/docs/challenge-types/ & https://community.letsencrypt.org/t/dns-providers-who-easily-integrate-with-lets-encrypt-dns-validation/86438)

@FedericoCeratto @bassosimone am I missing something?

hellais commented 5 years ago

Summarising what we discussed with @FedericoCeratto:

We should keep the option of anycast on the table and test it out on the eclips.is platform to see how well it works The two things to keep in mind are:
- How we can improve the upload speed and latency for users
- How we can do service monitoring to ensure graceful failover if:
- The service goes down or hangs but Nginx is still up
- The host is unreachable or crashed or unable to set up TCP connections
- A DC can be unreachable by probes due to network issues or blocking
I shall adopt a couple of different strategies and we should measure how well they work by adding some logging to nginx

Options to be evaluated are:

GeoIP based load balancing using route53 (or similar)
anycast IPs on eclips.is

The idea is to roll it out by making changes to the existing DNS configuration or the bouncer to redirect a small portion of clients to an experimental GeoIP/anycast setup and measure how much we are able to improve the performance of them.

bassosimone commented 5 years ago

@hellais, here are my thoughts:

This should also lead to a lowered RTT [...]

I believe this should be a measurable goal (see below). Another performance enhancing aspect is that we could reuse the same connection for all transactions.

The kind of feature set I am looking for when evaluating some options are [...]

This list looks good to me!

How we can improve the upload speed and latency for users [...]

The code at github.com/ooni/probe-engine could collect stats (in HAR format possibly). I believe we need to submit some telemetry, so we see what probes see.

The idea is to roll it out by making changes to the existing DNS configuration or the bouncer to redirect a small portion of clients to an experimental GeoIP/anycast setup and measure how much we are able to improve the performance of them.

I agree with this strategy. We can probably already start saying something from server side logs, but there is a bunch of client-side metrics it would be useful to have.

hellais commented 5 years ago

Yes I agree that we can get richer information by doing client side modifications, though if we can do it with server-side only changes I think it's preferable as that will allow us to collect metrics based on existing clients starting from today.

hellais commented 5 years ago

I did some experimentation with an anycast setup on eclips.is.

See: https://gist.github.com/hellais/96a704d3dc5ed34e7814a50f175469b7 & https://github.com/ooni/sysadmin/pull/375/commits/bbe44639c4f01b13684a9b26699c2b72b9bac55d

The results of these experiments is that it doesn't work as well as expected.

There are some pretty bad cases in which the host which is far from the ideal one is picked (actually the worse one is picked) see:

exit: 199.195.250.77 (US)
anycast domain: hkg-ps.ooni.nu
picked the WRONG host by rtt
picked the WRONG host by hop count
mia-ps.ooni.nu: 13 hops 32.23 ms
ams-ps.ooni.nu: 11 hops 79.54 ms
hkg-ps.ooni.nu: 17 hops 213.78 ms

In the tests linked in the above document we don't see mia-ps.ooni.nu ever getting picked, though @FedericoCeratto was able to get it picked when using https://lg.telia.net/?type=bgp&router=nyk-b2&address=37.218.244.15 & https://lg.telia.net/?type=trace&router=nyk-b2&address=37.218.244.15.

Based on these findings it's unclear if we want to go ahead with using the anycast configuration as it may make things worse than better and we are probably better off looking into geoip based load balancing at a DNS level with something like route53.

hellais commented 5 years ago

We have decided in the end to proceed with deploying the oonified hosts in 3 different locations, but for the time being only keep ams active.

This should already present several benefits as it means we only have to worry about a single host (which is easier to reason about and simpler) and also allows us to make tweaks in the client to gain performance and resilience gains (ex. re-using of connections with keep-alive).

Before considering more complicated or sofisticated options we are going to collector and analyse more data to better quantify the eventual performance and availabilty gains from each options.

We have deployed to all these hosts a prometheus metrics which collectors client TCP RTT, see: https://github.com/ooni/sysadmin/blob/master/ansible/roles/node_exporter/files/tcpmetrics.py.

As next steps we are going to carefully and incrementally rollover all the current production collectors to the new host, see: https://github.com/ooni/sysadmin/issues/386.

ooni / sysadmin

Improve availability of our infrastructure aka "oonification of probe services" #359