Open alsoba13 opened 3 days ago
Regarding the compaction lag: it's totally possible that L0 compaction is delayed because we only compact data once we accumulated enough blocks. However, in the PR you mentioned, we added a parameter that controls for how long a block might be staged.
We also introduced an indicator – time to compaction. In our dev env, L0 compaction lag does not exceed 1m and p99 is around 15-20 seconds.
I think that relying on the "current" time might be dangerous – we could explore an option where we get timestamp of the blocks (the time they are created). Also, I think that OOO ingestion is almost inevitable: jobs might be run concurrently, and their order is not guaranteed (we don't need it for compaction); usually, this is not an issue, but if the job fails and we retry (which we do), we will likely violate the order. Fortunately, both Mimit and Prometheus handle OOO (with some caveats)
Prerequisites
Exporting profile metrics at compaction time
This PoC shows how could be export metrics from profiles at compaction time (in fact we do this right after compaction, not at compaction time).
Compaction is something that happens eventually in every block of our object storage. This approach offers some benefits over exporting at ingestion time, as described by tempo's members:
In theory, the first level of compaction (L0 blocks to L1 blocks) is done shortly after the data ingestion (~10s). But in practice, I've observed that L0 compaction happens every 30-120s. I don't know the reasons of such delay (maybe data ingestion is low and compaction happen less often? I only ingest data of 1 tenant with 2 services - every 15s aprox)
Generated metrics
Now that we have a prototype running, we can get a picture of how generated metrics look like.
Dimensions
Every profile type or dimension is exported as a metric with this format:
So for example, if a service writes profile data of 3 different
__profile_type__
, we will export 3 different metrics:process_cpu:cpu:nanoseconds:cpu:nanoseconds
pyroscope_exported_metrics_process_cpu_cpu_nanoseconds_cpu_nanoseconds
memory:alloc_objects:count:space:bytes
pyroscope_exported_metrics_memory_alloc_objects_count_space_bytes
memory:alloc_space:bytes:space:bytes
pyroscope_exported_metrics_memory_alloc_space_bytes_space_bytes
Labels are preserved, unrolling new series for each labelset. So we can query for CPU of a specific pod of a service and some other pprof label like this:
Dimensions metrics are exported for every tenant and every service_name, but this should be configurable by the user.
Functions
This prototype explores also the ability to export metrics on specific functions. We can chose an interesting function to export.
Now it's exporting data for every dimension of the given function under this format:
In this prototype I've hardcoded Garbage colector and HTTP functions to export, for every service_name. I haven't make distinction on tenant yet. The functions to export should come from config (UI is a must here).
In the future, we could specify a filter of LabelSets instead of exporting by service_name. So for example
"foo": "{}"
would export every profile offoo
function. And"foo": "{service_name=\"my-service\", vehicle=\"bike\"}"
would export only for that service_name and vehicle.Detected challenges
This naive solution is full of trade-offs and assumptions and it's far from being a final solution. I've detected some challenges:
DEMO
I have a pyroscope with the changes running in my machine while exporting metrics to my grafana cloud instance.
Go grant yourself privileges in the admin page:
You can take a look on exported metrics here: https://albertosotogcp.grafana.net/explore/metrics/trail?from=now-1h&to=now&timezone=browser&var-ds=grafanacloud-prom&var-otel_resources=&var-filters=&var-deployment_environment=&metricSearch=pyroscope_&metricPrefix=all
You can see a demo dashboard, were I tried to simulate an alert of >20% of CPU of garbage collection or >60% of memory in HTTP requests: https://albertosotogcp.grafana.net/goto/eFRA8E7NR?orgId=1