measure-sh / measure

Measure is an open source tool to monitor mobile apps.
https://measure.sh
Apache License 2.0
498 stars 25 forks source link

Make ingestion faster #1226

Open detj opened 1 month ago

detj commented 1 month ago

Summary

Symbolication implementation as it exists today is the bottleneck for ingestion throughput. Improvements in the following areas can positively impact ingestion throughput. These optimizations apply regardless of platforms, making ingestion faster for present and future platforms.

  1. Don't work with large mapping files
  2. Only symbolicate once
  3. Implement ingestion queue

Don't work with large mapping files.

Mapping files are large. Only use a large mapping file once. Don't store them at all. Instead extract the useful parts from a mapping file with the intent to compress as much information in it while still being functional for the purpose of symbolication. Let's call it a reduced mapping file and we store only reduced mapping files.

Only symbolicate once

Right now, we symbolicate every single crash and ANR, instead symbolicate only if we have to. This might drastically reduce the load on the symbolicator services and improve ingestion throughput. Here's the high-level algorithm.

  1. Compute a hash for each new crash/ANR
  2. Check if a symbolicated result already exists against the hash
  3. If found, replace the crash/ANR with symbolicated data
  4. If not found, symbolicate and store the hash and computed symbolicated data
  5. Repeat.

What about Versions and Architecture

We don't have to associate with versions and architecture to keep the logic simple and reusable. But, if there's a need for it, we can modify the approach accordingly.

Implement ingestion queue.

At present, the whole ingestion pipeline is synchronous, limiting our ability to parallelize ingestion. Instead, let's introduce 3 separate buckets. One will store unprocessed events. Second will store symbolicated and grouped crashes/ANRs and the third will store events that are ready to be flushed to ClickHouse. Here's the high-level algorithm.

  1. Validate the incoming event batch
  2. If validated, put it in the raw queue
  3. Reply with a 202 Accepted to the client
  4. An event worker takes events, symbolicates, groups them and puts them in processed queue
  5. An ingestion worker takes these processed events and stores them at once to ClickHouse
flowchart LR
   A[event arrival] -- put in --> B[raw queue] -- symbolicate and group --> C[processed queue] -- store in --> D[ClickHouse]

Additional notes

  1. These queues should be fail-safe. If at any point, the system recovers from a crash, it should restore all data in the queues and let the workers transparently resume execution.

  2. Queue progress should be tracked using OpenTelemetry Traces and Metrics. Furthermore, track the complete time it takes from the point of event arrival to event storage.

  3. Event workers should not interface with any databases. Their responsibility is limited to carrying out only event transformations.

  4. Ingestion workers should not carry out transformations. Their responsibility is limited to only storing data.

  5. Ingestion workers should persist when queue size exceeds a configurable limit (like 1 million) or exceeds a configurable timeout, whichever happens first.

  6. Present ingestion rate for future comparison.

    M2 Macbook Pro 32GB - 172 events/s
    GCP `e2-standard-4` Intel Broadwell 16GB - 24 events/s (with 40 concurrent batches)