grafana / pyroscope

Continuous Profiling Platform. Debug performance issues down to a single line of code
https://grafana.com/oss/pyroscope/
GNU Affero General Public License v3.0
9.64k stars 574 forks source link

Traces get rejected if labels have dashes in them #3359

Open kbolashev opened 2 weeks ago

kbolashev commented 2 weeks ago

Describe the bug

Hi Pyroscope team, thank you very much for this great tool!

We're trying to integrate it into our stack to profile some Go code, but hitting problems with ingesting goroutine and cpu traces.

Some of dependency code we're using that calls subprocesses for execution, sets pprof labels on goroutines to track these subprocesses (PID, etc.). These labels have dashes in them (e.g. process-description), which seems to conflict with Pyroscope's ingestion and we're failing to receive CPU and Goroutine traces (memory traces work fine).

I found the validation code, assuming I'm hitting this line. I'm wondering, if there's a way to skirt by this validation, or alternatively maybe sanitize these at pyroscope-go level.

To Reproduce

Steps to reproduce the behavior:

  1. Start Pyroscope (I have version 1.5.1 running in my local kind cluster right now) and have it be accessible on http://localhost:4040
  2. Clone the reproduction repo, build and run the package.

Expected behavior

All traces show up in pyroscope

Actual: the cpu and goroutine traces don't show up,

Environment

Additional Context

Error in the stdout:

[ERROR] upload profile: failed to upload. server responded with statusCode: '422' and body: '{"code":"unknown","message":"pushing IngestInput-pprof failed invalid_argument: invalid labels '{__delta__=\"false\", __name__=\"goroutines\", hello-tag=\"world\", pyroscope_spy=\"gospy\", service_name=\"repro\"}' with error: invalid label name 'hello-tag'"}
'
[DEBUG] uploading at http://localhost:4040/ingest?aggregationType=sum&from=1718631497338591000&name=repro%7B__session_id__%3D528d9c54dd243618%7D&sampleRate=100&spyName=gospy&units=samples&until=1718631502391866000
[DEBUG] content type: multipart/form-data; boundary=2877c635eaa015fe8e86cf857751d53f590c439a0368025bf68c04d8c286
[ERROR] upload profile: failed to upload. server responded with statusCode: '422' and body: '{"code":"unknown","message":"pushing IngestInput-pprof failed invalid_argument: invalid labels '{__delta__=\"false\", __name__=\"process_cpu\", hello-tag=\"world\", pyroscope_spy=\"gospy\", service_name=\"repro\"}' with error: invalid label name 'hello-tag'"}
'
kolesnikovae commented 2 weeks ago

Hi @kbolashev! Thank you for filing the issue.

We're currently using prometheus data model for labels (including those that are set as pprof tags) for consistency with other products:

Labels may contain ASCII letters, numbers, as well as underscores. They must match the regex [a-zA-Z][a-zA-Z0-9]*.

Recently, we've relaxed the requirement, by allowing . character (replaced to _) in https://github.com/grafana/pyroscope/pull/3335. This is an exception made specifically for OTel compatibility.

Before UTF-8 labels are fully supported in Prometheus, I'm a little hesitant to add more exceptions. Instead, I'd consider altering how we handle such labels: instead of discarding the profile, we could drop the invalid labels. What do you think?

I should note that the use of high cardinality labels will very likely cause performance issues. For example, if you set the PID of the spawn processes as a pprof label, you will get a new profile entry in the storage for each such process. If you wouldn't add this label to Prometheus metrics, you wouldn't want to add it as a pprof label either.

Some of dependency code we're using that calls subprocesses for execution, sets pprof labels on goroutines to track these subprocesses (PID, etc.). These labels have dashes in them (e.g. process-description), which seems to conflict with Pyroscope's ingestion and we're failing to receive CPU and Goroutine traces (memory traces work fine).

Out of curiosity, is this a publicly available package?

kbolashev commented 2 weeks ago

Before UTF-8 labels are fully supported in Prometheus, I'm a little hesitant to add more exceptions. Instead, I'd consider altering how we handle such labels: instead of discarding the profile, we could drop the invalid labels. What do you think?

Honestly this would totally work for us, because being able to see the goroutines at all already gets much further and we don't really care about drilling down deeper for now. Only problem I see is knowing that the labels actually got discarded.

Out of curiosity, is this a publicly available package?

Yes, sure! We're reusing some of Gitea's code for parsing git repos, and they seem to have their own in-house profiling. https://github.com/go-gitea/gitea/blob/main/modules/process/manager.go#L29

I should note that the use of high cardinality labels will very likely cause performance issues. For example, if you set the PID of the spawn processes as a pprof label, you will get a new profile entry in the storage for each such process. If you wouldn't add this label to Prometheus metrics, you wouldn't want to add it as a pprof label either.

Thanks for the warning on this! We'll keep track at how it performs and whether the performance degrades too much for us. If the cardinality does turn out to be the problem, would you suggest trying to drop those labels altogether?

kolesnikovae commented 2 weeks ago

Thank you for the input, @kbolashev! We'll look into this soon (I'm sorry, I don't have a clear ETA).

If the cardinality does turn out to be the problem, would you suggest trying to drop those labels altogether?

Yeah, or scaling the Pyroscope deployment if performance degrades badly and the labels are really needed