ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
48 stars 28 forks source link

Excessive load in oohelperd incident - December 21, 2023 #784

Closed jbonisteel closed 5 months ago

jbonisteel commented 6 months ago

Impact:

Detection:

Timeline:

What is still unclear:

Next steps:

Some notes from @FedericoCeratto:

Some notes from @bassosimone:

Lanius-collaris commented 6 months ago

Many measurements in China failed, does Web Connectivity v0.5 consume more resource than Web Connectivity v0.4? For example: https://explorer.ooni.org/m/20231223030445.527152_CN_webconnectivity_bd81b5f4f92bba44

bassosimone commented 5 months ago

FTR, the issue disappeared on January 26, 2024. It seems someone was preliminary integrating a miniooni fleet deployment with several synchronized probes and --no-collector and then they figured out that it was better to desynchronize the probes to avoid creating thundering herd issues.

bassosimone commented 5 months ago

What we should consider doing in the future:

  1. deploy the test helpers in hosts with more RAM and CPU capabilities (mostly CPU);
  2. figure out whether there's a way to more easily scale to more hosts that requires less manual intervention;
  3. evaluate whether restructuring oohelperd to cache measurements for individual endpoints for more time would possibly help us to reduce the load and would not cause more headaches than benefits;
  4. query ooni/data to get information about recent measurements rather than performing measurements;
  5. schedule measurements periodically rather than doing them on demand (which possibly reduces the load especially if we use a single table to keep test helpers information);
  6. use special probes to collect test helper information rather than having test helpers;
  7. think about whether we could reduce the CPU usage caused by the handshakes by making sure that we're using assembly instruction as much as possible (probably already the case, but it makes sense to double check);
  8. see if it's possible to avoid one handshake by using an already-handshaked connection for the first HTTP request;
  9. just deploy two more test helpers (simple solution FTW to avoid too many faulty measurements).

If other potential future activities come to mind, I will update this list.

bassosimone commented 5 months ago

Many measurements in China failed, does Web Connectivity v0.5 consume more resource than Web Connectivity v0.4? For example: https://explorer.ooni.org/m/20231223030445.527152_CN_webconnectivity_bd81b5f4f92bba44

@Lanius-collaris, Web Connectivity v0.5 consumes more resources on the probe side. Regarding the test helper side, I think it may be using a bit more resources if we discover more IP addresses with v0.5 than with v0.4, which may happen as a side effect of using three resolvers (system, DNS-over-UDP, and DNS-over-HTTPS every ~30 seconds to passively collect information about DNS-over-HTTPS functionality).

The reason why you see 503 errors is that we added extra code that refuses the service if there's already significant queue of clients in the test helper. The result is that measurements are marked as failed. However, we're rewriting the data processing pipeline to basically ignore the test helper and recompute whether there was blocking. When this process converges, we're going to have correct results in Explorer also for measurements that are failed. The key word for this data pipeline rewriting is ooni/data and the repository is https://github.com/ooni/data.

After switching to using ooni/data, we would still see measurements marked as failed locally by the probe if the test helper returns 503, therefore results on a mobile phone or in the CLI would be less useful than they could. To solve this problem, I think the correct backward compatible approach would be to use ooni/data information to populate the TH response when possible, thus reducing the need to measure several websites in the TH. Another approach would instead be that of adding more test helpers, where the bottleneck is that our deployment code is not designed for that, so we would need to put in some development work and some backend developer time for that.

On our end in terms of test helper metrics, as you can see below, the problem seems to still be present but it seems much less impactful than it was when we opened the issue:

Screenshot 2024-02-08 at 17 51 07

Are you still affected?

Thank you!

bassosimone commented 5 months ago

Upon reflection, I think the next step should be this one: https://github.com/ooni/probe/issues/2672. Because it's relatively simpler it should be possible to do it soon. With more test helpers we would reduce the overall load and have less measurements being marked as faulty in the backend and the probe. In the meanwhile, we will buy us more time to implement smarter solutions to fix the problem itself.