Excessive load in oohelperd incident - December 21, 2023

jbonisteel commented 6 months ago

Impact:

Impact on performance
potentially some missing measurements from miniooni
further information: https://github.com/ooni/probe/issues/2649

Detection:

Alerts in ooni-bots channel on Slack started firing

Timeline:

[approx 8:00-10:00UTC] @hellais and @bassosimone started investigating, noticed an hourly pattern to the increase in load
[approx 15:40 UTC] a mitigation to respond with a 504 conditioned on the client name was deployed on 0.th and 1.th
[approx 17:18 UTC] mitigation was determined to be a success
[approx 19:00 UTC] @bassosimone starts to release ooni/probe 3.20.1 to ship the mitigation for all THs

What is still unclear:

who are what is causing this increased load
more clarity on what the impact to measurements has been

Next steps:

[x] merge, release, & deploy (today)
[x] monitor the TH dashboard
[ ] open issue about investigating if we can optimize the TLS handshake performance
[ ] see if we can understand who's running these probes
[ ] assess impact on measurements
[ ] have a proper post-mortem conversation, likely in January

Some notes from @FedericoCeratto:

we don't know where these probes are. They appear to come from at least 2-3 thousand ipaddrs and across various CCs.
we've been somewhat lucky that they all share the same software_name and we can throttle on it. If the same thing happened with ooniprobe-android-unattended it would be a problem
so far there's never been a similar event. In future there could be mitigation in place to do throttling based on combinations of CC / software_name / ASN - perhaps also using cryptographic Probe identity
for load-balancing workloads haproxy could be a better option than nginx because it can support more flexible mechanisms
it could be useful to log CC / ASN / software_name / software_version of probes hitting the TH and compare this stuff with what is in the fastpath and trigger alarms if there are unexpected differences
there is an ongoing effort for putting THs behind the ooni-bridges ipaddrs and using haproxy as a load balancer. With that it would be also possible to increase and decrease the number of THs to handle bigger workloads without having to modify probes

Some notes from @bassosimone:

it's unclear why but it seems these miniooni probes are not submitting measurements

Lanius-collaris commented 6 months ago

Many measurements in China failed, does Web Connectivity v0.5 consume more resource than Web Connectivity v0.4? For example: https://explorer.ooni.org/m/20231223030445.527152_CN_webconnectivity_bd81b5f4f92bba44

bassosimone commented 5 months ago

FTR, the issue disappeared on January 26, 2024. It seems someone was preliminary integrating a miniooni fleet deployment with several synchronized probes and --no-collector and then they figured out that it was better to desynchronize the probes to avoid creating thundering herd issues.

bassosimone commented 5 months ago

What we should consider doing in the future:

deploy the test helpers in hosts with more RAM and CPU capabilities (mostly CPU);
figure out whether there's a way to more easily scale to more hosts that requires less manual intervention;
evaluate whether restructuring oohelperd to cache measurements for individual endpoints for more time would possibly help us to reduce the load and would not cause more headaches than benefits;
query ooni/data to get information about recent measurements rather than performing measurements;
schedule measurements periodically rather than doing them on demand (which possibly reduces the load especially if we use a single table to keep test helpers information);
use special probes to collect test helper information rather than having test helpers;
think about whether we could reduce the CPU usage caused by the handshakes by making sure that we're using assembly instruction as much as possible (probably already the case, but it makes sense to double check);
see if it's possible to avoid one handshake by using an already-handshaked connection for the first HTTP request;
just deploy two more test helpers (simple solution FTW to avoid too many faulty measurements).

If other potential future activities come to mind, I will update this list.

bassosimone commented 5 months ago

Many measurements in China failed, does Web Connectivity v0.5 consume more resource than Web Connectivity v0.4? For example: https://explorer.ooni.org/m/20231223030445.527152_CN_webconnectivity_bd81b5f4f92bba44

@Lanius-collaris, Web Connectivity v0.5 consumes more resources on the probe side. Regarding the test helper side, I think it may be using a bit more resources if we discover more IP addresses with v0.5 than with v0.4, which may happen as a side effect of using three resolvers (system, DNS-over-UDP, and DNS-over-HTTPS every ~30 seconds to passively collect information about DNS-over-HTTPS functionality).

The reason why you see 503 errors is that we added extra code that refuses the service if there's already significant queue of clients in the test helper. The result is that measurements are marked as failed. However, we're rewriting the data processing pipeline to basically ignore the test helper and recompute whether there was blocking. When this process converges, we're going to have correct results in Explorer also for measurements that are failed. The key word for this data pipeline rewriting is ooni/data and the repository is https://github.com/ooni/data.

After switching to using ooni/data, we would still see measurements marked as failed locally by the probe if the test helper returns 503, therefore results on a mobile phone or in the CLI would be less useful than they could. To solve this problem, I think the correct backward compatible approach would be to use ooni/data information to populate the TH response when possible, thus reducing the need to measure several websites in the TH. Another approach would instead be that of adding more test helpers, where the bottleneck is that our deployment code is not designed for that, so we would need to put in some development work and some backend developer time for that.

On our end in terms of test helper metrics, as you can see below, the problem seems to still be present but it seems much less impactful than it was when we opened the issue:

Are you still affected?

Thank you!

bassosimone commented 5 months ago

Upon reflection, I think the next step should be this one: https://github.com/ooni/probe/issues/2672. Because it's relatively simpler it should be possible to do it soon. With more test helpers we would reduce the overall load and have less measurements being marked as faulty in the backend and the probe. In the meanwhile, we will buy us more time to implement smarter solutions to fix the problem itself.

ooni / backend

Excessive load in oohelperd incident - December 21, 2023 #784