grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.05k stars 515 forks source link

Distributed tracing support w3c tracing headers (opentelemtry) #8714

Open jmichalek132 opened 2 months ago

jmichalek132 commented 2 months ago

Is your feature request related to a problem? Please describe.

In case you use nginx, i.e. nginx ingress in front of mimir you can't get a full trace containing both the spans from the nginx and mimir. This is due to the fact that nginx otel module (https://nginx.org/en/docs/ngx_otel_module.html) only supports W3C context propagation.

Meanwhile Mimir (when receiving and http call) only expects jaeger tracing headers. This is how it's handled. https://github.com/grafana/dskit/blob/main/middleware/http_tracing.go#L49 https://github.com/opentracing-contrib/go-stdlib/blob/master/nethttp/server.go#L124

Describe the solution you'd like

Implement support for w3c context propagation when handling incoming http requests.

Describe alternatives you've considered

Opening an issue on nginx side to add support for jaeger tracing headers.

Doing https://github.com/grafana/mimir/issues/2708 and fully migrating Mimir to Opentelemetry for tracing.

Switching to a different proxy / ingress on our side but that feels like and overkill.

Additional context

I would be interested in helping with addressing this, just not sure what's the best way to fix it. Also given prometheus is instrumented using opentelemetry for traces now too, and doesn't have config option to also set jaeger tracing headers, it might not be possible to get a full trace of remote write request to mimir without having proxy in the middle that could convert the w3c headers to jaeger ones.

bboreham commented 2 months ago

I don't think opentracing-contrib upstream is going to take any more PRs, but we are already using a fork https://github.com/grafana/opentracing-contrib-go-stdlib, so that is one possibility.

Converting everything to OpenTelemetry is still the long-term aim but it's (a) a lot of work and (b) even more work to check and mitigate performance degredations. https://github.com/grafana/dskit/pull/385 is work in progress.