Closed hellais closed 5 years ago
Something like Consul can be used to register available endpoints and do simple health probes. From there it can dynamically give DNS results based on what's currently healthy.
As part of this I am exploring the various options that are available for moving our DNS some place else.
We currently are using DNS provided to us by namecheap our registrar.
The kind of feature set I am looking for when evaluating some options are:
@FedericoCeratto @bassosimone am I missing something?
Summarising what we discussed with @FedericoCeratto:
We should keep the option of anycast on the table and test it out on the eclips.is platform to see how well it works The two things to keep in mind are:
I shall adopt a couple of different strategies and we should measure how well they work by adding some logging to nginx
Options to be evaluated are:
The idea is to roll it out by making changes to the existing DNS configuration or the bouncer to redirect a small portion of clients to an experimental GeoIP/anycast setup and measure how much we are able to improve the performance of them.
@hellais, here are my thoughts:
This should also lead to a lowered RTT [...]
I believe this should be a measurable goal (see below). Another performance enhancing aspect is that we could reuse the same connection for all transactions.
The kind of feature set I am looking for when evaluating some options are [...]
This list looks good to me!
How we can improve the upload speed and latency for users [...]
The code at github.com/ooni/probe-engine could collect stats (in HAR format possibly). I believe we need to submit some telemetry, so we see what probes see.
The idea is to roll it out by making changes to the existing DNS configuration or the bouncer to redirect a small portion of clients to an experimental GeoIP/anycast setup and measure how much we are able to improve the performance of them.
I agree with this strategy. We can probably already start saying something from server side logs, but there is a bunch of client-side metrics it would be useful to have.
Yes I agree that we can get richer information by doing client side modifications, though if we can do it with server-side only changes I think it's preferable as that will allow us to collect metrics based on existing clients starting from today.
I did some experimentation with an anycast setup on eclips.is.
See: https://gist.github.com/hellais/96a704d3dc5ed34e7814a50f175469b7 & https://github.com/ooni/sysadmin/pull/375/commits/bbe44639c4f01b13684a9b26699c2b72b9bac55d
The results of these experiments is that it doesn't work as well as expected.
There are some pretty bad cases in which the host which is far from the ideal one is picked (actually the worse one is picked) see:
exit: 199.195.250.77 (US)
anycast domain: hkg-ps.ooni.nu
picked the WRONG host by rtt
picked the WRONG host by hop count
mia-ps.ooni.nu: 13 hops 32.23 ms
ams-ps.ooni.nu: 11 hops 79.54 ms
hkg-ps.ooni.nu: 17 hops 213.78 ms
In the tests linked in the above document we don't see mia-ps.ooni.nu
ever getting picked, though @FedericoCeratto was able to get it picked when using https://lg.telia.net/?type=bgp&router=nyk-b2&address=37.218.244.15 & https://lg.telia.net/?type=trace&router=nyk-b2&address=37.218.244.15.
Based on these findings it's unclear if we want to go ahead with using the anycast configuration as it may make things worse than better and we are probably better off looking into geoip based load balancing at a DNS level with something like route53.
We have decided in the end to proceed with deploying the oonified hosts in 3 different locations, but for the time being only keep ams
active.
This should already present several benefits as it means we only have to worry about a single host (which is easier to reason about and simpler) and also allows us to make tweaks in the client to gain performance and resilience gains (ex. re-using of connections with keep-alive).
Before considering more complicated or sofisticated options we are going to collector and analyse more data to better quantify the eventual performance and availabilty gains from each options.
We have deployed to all these hosts a prometheus metrics which collectors client TCP RTT, see: https://github.com/ooni/sysadmin/blob/master/ansible/roles/node_exporter/files/tcpmetrics.py.
As next steps we are going to carefully and incrementally rollover all the current production collectors to the new host, see: https://github.com/ooni/sysadmin/issues/386.
We have been working on doing a bit of research into how we can improve the availability of the OONI probe services (PS) infrastructure (bouncer, collector, registry, orchestrate, test-helpers).
In this issue I will document where we are at and what the next steps to roll this out should be.
The core problem is:
In order to address this we have come to the conclusion that it's best to drastically simplify our infrastructure and run all services required by probes (bouncer, collector, registry, orchestrate, test-helpers) on a single VM one for each of the supported locations: Miami (MIA), Amsterdam (AMS), Hong Kong (HKG).
The probe logic should also be adjusted to try, when attempting to communicate with a probe service, in order: ANYCAST DNS, DNS, IP ADDRESS and then as a last resort use some form of circumvention tool.
This should also lead to a lowered RTT, making test runs faster and also to an increased reliability.
In practice we need to look at all of the following services and find a way to make them work nicely on a single host:
In front of these will be placed an nginx reverse proxy which shall map all requests to the required backends. This requires looking at each of these components and ensuring that there are no conflicts with the API endpoints.
For naming I propose we call the single host entry point as:
hkg-ps.ooni.io
ams-ps.ooni.io
mia-ps.ooni.io
Where PS stands for Probe Services.
I did some back of the envelope calculations to ensure the VMs aren't currently requiring soo much resources that it's not possible to run them on a single host:
And it seems like we are in pretty good shape.
The total memory usage of all the hosts:
hkgcollectora.ooni.nu miacollector.ooni.nu c.collector.ooni.io hkgbouncer.ooni.nu c.web-connectivity.th.ooni.io hkgwebconnectivitya.ooni.nu b.web-connectivity.th.ooni.io events.proteus.ooni.io registry.proteus.ooni.io db-1.proteus.ooni.io proteus.ooni.io notify.proteus.ooni.io
Is less than 4GB average in a 24h window and the CPU usage is pretty low as well.
@FedericoCeratto @bassosimone what do you think, did I miss something?