ooni / data

OONI Data CLI and Pipeline v5
https://docs.ooni.org/data
8 stars 4 forks source link

Assess and improve performance of observation generation #37

Closed hellais closed 10 months ago

hellais commented 10 months ago

Summary of changes:

Prior to this refactor we were achieving a rate of 5000 measurements per second, but were only reaching this speed when processing multiple bucket_dates at the same time.

This was happening because the parallelisation was happening on a per day bucket basic. When processing a single day worth of data (which is what happens in the daily batch), we were only reaching a speed of about 500 measurements per second.

As part of this PR I have added support for parallelising on a per archived file basis, which allows us to reach a performance of 7-8k measurements per second on a per day basis (which is actually faster than what we were doing before). The splitting into batches happens based on the size of the files so we should be getting a pretty consistent performance irrespective of the size of the daily batch.

This is the first step towards improving the performance of reprocessing of legacy buckets, where each daily bucket is too small to benefit from just splitting based on day (we ideally want to process batches of days together).

I also implemented as part of this same PR some changes to the schema of HIRL to make it more useful for analysis.

hellais commented 10 months ago

I came to the bottom of why the processing time for measurements in 2021 buckets was so slow. The root cause of it is https://github.com/ooni/backend/issues/763.

In the interim, the refactoring in this branch does achieve marginally better performance, yet it's currently capped by the network (I am fully saturating the 1 GBiT link), see: Screenshot 2023-10-31 at 18 49 32

There isn't much more I can do except get a machine with a faster network or reprocess the postcans to compress them from a place that is closer to s3.

codecov-commenter commented 10 months ago

Codecov Report

Attention: 73 lines in your changes are missing coverage. Please review.

Comparison is base (410b5ae) 74.46% compared to head (fe698ec) 73.93%.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #37 +/- ## ========================================== - Coverage 74.46% 73.93% -0.54% ========================================== Files 70 70 Lines 5883 5993 +110 ========================================== + Hits 4381 4431 +50 - Misses 1502 1562 +60 ``` | [Files](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni) | Coverage Δ | | |---|---|---| | [oonidata/cli/command.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-b29uaWRhdGEvY2xpL2NvbW1hbmQucHk=) | `78.70% <ø> (-0.54%)` | :arrow_down: | | [oonidata/models/observations.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-b29uaWRhdGEvbW9kZWxzL29ic2VydmF0aW9ucy5weQ==) | `96.61% <100.00%> (+0.06%)` | :arrow_up: | | [tests/test\_cli.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-dGVzdHMvdGVzdF9jbGkucHk=) | `95.12% <100.00%> (+0.67%)` | :arrow_up: | | [oonidata/db/create\_tables.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-b29uaWRhdGEvZGIvY3JlYXRlX3RhYmxlcy5weQ==) | `52.00% <75.00%> (+0.63%)` | :arrow_up: | | [...a/transforms/nettests/http\_invalid\_request\_line.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-b29uaWRhdGEvdHJhbnNmb3Jtcy9uZXR0ZXN0cy9odHRwX2ludmFsaWRfcmVxdWVzdF9saW5lLnB5) | `22.22% <0.00%> (-1.04%)` | :arrow_down: | | [oonidata/dataclient.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-b29uaWRhdGEvZGF0YWNsaWVudC5weQ==) | `86.41% <83.67%> (+0.39%)` | :arrow_up: | | [oonidata/datautils.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-b29uaWRhdGEvZGF0YXV0aWxzLnB5) | `71.50% <53.12%> (-3.65%)` | :arrow_down: | | [oonidata/workers/observations.py](https://app.codecov.io/gh/ooni/data/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni#diff-b29uaWRhdGEvd29ya2Vycy9vYnNlcnZhdGlvbnMucHk=) | `60.00% <48.88%> (-16.72%)` | :arrow_down: | ... and [2 files with indirect coverage changes](https://app.codecov.io/gh/ooni/data/pull/37/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ooni)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.