Closed jbonisteel closed 5 months ago
Many measurements in China failed, does Web Connectivity v0.5 consume more resource than Web Connectivity v0.4? For example: https://explorer.ooni.org/m/20231223030445.527152_CN_webconnectivity_bd81b5f4f92bba44
FTR, the issue disappeared on January 26, 2024. It seems someone was preliminary integrating a miniooni
fleet deployment with several synchronized probes and --no-collector
and then they figured out that it was better to desynchronize the probes to avoid creating thundering herd issues.
What we should consider doing in the future:
oohelperd
to cache measurements for individual endpoints for more time would possibly help us to reduce the load and would not cause more headaches than benefits;ooni/data
to get information about recent measurements rather than performing measurements;If other potential future activities come to mind, I will update this list.
Many measurements in China failed, does Web Connectivity v0.5 consume more resource than Web Connectivity v0.4? For example: https://explorer.ooni.org/m/20231223030445.527152_CN_webconnectivity_bd81b5f4f92bba44
@Lanius-collaris, Web Connectivity v0.5 consumes more resources on the probe side. Regarding the test helper side, I think it may be using a bit more resources if we discover more IP addresses with v0.5 than with v0.4, which may happen as a side effect of using three resolvers (system, DNS-over-UDP, and DNS-over-HTTPS every ~30 seconds to passively collect information about DNS-over-HTTPS functionality).
The reason why you see 503 errors is that we added extra code that refuses the service if there's already significant queue of clients in the test helper. The result is that measurements are marked as failed. However, we're rewriting the data processing pipeline to basically ignore the test helper and recompute whether there was blocking. When this process converges, we're going to have correct results in Explorer also for measurements that are failed. The key word for this data pipeline rewriting is ooni/data and the repository is https://github.com/ooni/data.
After switching to using ooni/data, we would still see measurements marked as failed locally by the probe if the test helper returns 503, therefore results on a mobile phone or in the CLI would be less useful than they could. To solve this problem, I think the correct backward compatible approach would be to use ooni/data information to populate the TH response when possible, thus reducing the need to measure several websites in the TH. Another approach would instead be that of adding more test helpers, where the bottleneck is that our deployment code is not designed for that, so we would need to put in some development work and some backend developer time for that.
On our end in terms of test helper metrics, as you can see below, the problem seems to still be present but it seems much less impactful than it was when we opened the issue:
Are you still affected?
Thank you!
Upon reflection, I think the next step should be this one: https://github.com/ooni/probe/issues/2672. Because it's relatively simpler it should be possible to do it soon. With more test helpers we would reduce the overall load and have less measurements being marked as faulty in the backend and the probe. In the meanwhile, we will buy us more time to implement smarter solutions to fix the problem itself.
Impact:
Detection:
Timeline:
What is still unclear:
Next steps:
Some notes from @FedericoCeratto:
Some notes from @bassosimone: