Open wucao opened 3 years ago
We need to make the name sanitizing more strict. Right now it only swaps a couple of characters, but perhaps we should flip that and only allow a fixed set of characters (like alpha-num + forward-slash) and override everything else.
Alternatively, we could avoid calling prom_client methods that panic and instead call those that return errors, in this case hist.GetMetricWithLabelValues
instead of hist.WithLabelValues
. Then use some kind of fallback. Even better if our own API was returning errors, but that would be breaking change.
Sanitizing would fix the issue at hand, calling methods that return errors would be a good solution for the unknown issues. Question: is this causing the collector to panic, or are we recovering somehow?
Do you have any updates on this? I have the same problem 😢 .
I was able to change the METRICS_BACKEND
to expvar
and I think it is working correctly now, but I'm not sure if there is any performance overhead doing so.
I have a narrowly scoped fix #4051 that will address the specific problem with collector's http endpoints.
It still leaves a broader issue open that some external inputs (such as service names) are used as Prometheus labels and could cause panics. I would prefer to address that more generally when we migrate all metrics to OTEL. We'd probably still want to keep the metrics factory abstraction, but at least chance its API to return errors from factories, so that these issues could be handled gracefully without panics.
I am running jaegertracing/all-in-one:1.45.0 with Otel, and I am having the same issue when \x85
is being passed as a value. This is the stacktrace:
panic: label value "label__3d\x85_this_will_panic-repro" is not valid UTF-8
goroutine 14097 [running]:
github.com/prometheus/client_golang/prometheus.(*CounterVec).WithLabelValues(...)
github.com/prometheus/client_golang@v1.15.0/prometheus/counter.go:274
github.com/jaegertracing/jaeger/internal/metrics/prometheus.(*Factory).Counter(0xc000676be0, {{0x17bb0f7, 0x8}, 0xc02563a4b0, {0x0, 0x0}})
github.com/jaegertracing/jaeger/internal/metrics/prometheus/factory.go:144 +0x5f1
github.com/jaegertracing/jaeger/internal/metrics/fork.(*Factory).Counter(0x156c7a0?, {{0x17bb0f7, 0x8}, 0xc02563a4b0, {0x0, 0x0}})
github.com/jaegertracing/jaeger/internal/metrics/fork/fork.go:43 +0x42
github.com/jaegertracing/jaeger/cmd/collector/app.(*spanCountsBySvc).countByServiceName(0xc000bb6c98, {0xc022ea3500?, 0x0?}, 0x0)
github.com/jaegertracing/jaeger/cmd/collector/app/metrics.go:331 +0x207
github.com/jaegertracing/jaeger/cmd/collector/app.metricsBySvc.countSpansByServiceName(...)
github.com/jaegertracing/jaeger/cmd/collector/app/metrics.go:242
github.com/jaegertracing/jaeger/cmd/collector/app.metricsBySvc.ReportServiceNameForSpan({{{0xc000678cc0, 0xc000678e10, {0x1deaba0, 0xc000678c00}, 0xc000573218, 0xfa0, {0x17bb0f7, 0x8}}}, {{0xc000678f60, 0xc000679710, ...}, ...}}, ...)
github.com/jaegertracing/jaeger/cmd/collector/app/metrics.go:233 +0xc9
github.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).enqueueSpan(0xc0001ae3f0, 0xc02b38da40, {0x17b443c, 0x5}, {0x17b2cb7?, 0x0?}, {0x0, 0x0})
github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:233 +0x205
github.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).ProcessSpans(0xc0001ae3f0, {0xc008155c68?, 0x1, 0xc0255bf140?}, {{0x17b443c, 0x5}, {0x17b2cb7, 0x4}, {0x0, 0x0}})
github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:194 +0x145
github.com/jaegertracing/jaeger/cmd/collector/app/handler.(*batchConsumer).consume(0xc0005a6ba0, {0x1de9138?, 0xc02563a480?}, 0xc01f2ade80)
github.com/jaegertracing/jaeger/cmd/collector/app/handler/grpc_handler.go:88 +0x2ac
github.com/jaegertracing/jaeger/cmd/collector/app/handler.(*consumerDelegate).consume(0xc0005a6ba0, {0x1de9138, 0xc02563a480}, {0x1579fc0?})
github.com/jaegertracing/jaeger/cmd/collector/app/handler/otlp_receiver.go:168 +0x8a
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/consumer@v0.76.1/traces.go:36
go.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export(0xc0005b64b0, {0x1de9138, 0xc02563a3f0}, {0x0?})
go.opentelemetry.io/collector/receiver/otlpreceiver@v0.76.1/internal/trace/otlp.go:52 +0xdb
go.opentelemetry.io/collector/pdata/ptrace/ptraceotlp.rawTracesServer.Export({{0x1ddff40?, 0xc0005b64b0?}}, {0x1de9138?, 0xc02563a3f0?}, 0x170a500?)
go.opentelemetry.io/collector/pdata@v1.0.0-rcv0011/ptrace/ptraceotlp/grpc.go:94 +0xd1
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler.func1({0x1de9138, 0xc02563a3f0}, {0x170f080?, 0xc023aeb170})
go.opentelemetry.io/collector/pdata@v1.0.0-rcv0011/internal/data/protogen/collector/trace/v1/trace_service.pb.go:310 +0x78
go.opentelemetry.io/collector/config/configgrpc.enhanceWithClientInformation.func1({0x1de9138?, 0xc02563a390?}, {0x170f080, 0xc023aeb170}, 0x0?, 0xc023aeb188)
go.opentelemetry.io/collector@v0.76.1/config/configgrpc/configgrpc.go:411 +0x4c
google.golang.org/grpc.getChainUnaryHandler.func1({0x1de9138, 0xc02563a390}, {0x170f080, 0xc023aeb170})
google.golang.org/grpc@v1.54.0/server.go:1164 +0xb9
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x1de9138, 0xc02563a2d0}, {0x170f080, 0xc023aeb170}, 0xc0096762a0, 0xc01f2adb00)
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.40.0/interceptor.go:342 +0x528
google.golang.org/grpc.chainUnaryInterceptors.func1({0x1de9138, 0xc02563a2d0}, {0x170f080, 0xc023aeb170}, 0xc03add5a58?, 0x157ad40?)
google.golang.org/grpc@v1.54.0/server.go:1155 +0x8f
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler({0x154cba0?, 0xc0005a8860}, {0x1de9138, 0xc02563a2d0}, 0xc03266bdc0, 0xc0001abf00)
go.opentelemetry.io/collector/pdata@v1.0.0-rcv0011/internal/data/protogen/collector/trace/v1/trace_service.pb.go:312 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000268780, {0x1df1e40, 0xc0007f5a00}, 0xc025600360, 0xc0005b94a0, 0x29284d0, 0x0)
google.golang.org/grpc@v1.54.0/server.go:1345 +0xdf3
google.golang.org/grpc.(*Server).handleStream(0xc000268780, {0x1df1e40, 0xc0007f5a00}, 0xc025600360, 0x0)
google.golang.org/grpc@v1.54.0/server.go:1722 +0xa36
google.golang.org/grpc.(*Server).serveStreams.func1.2()
google.golang.org/grpc@v1.54.0/server.go:966 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
google.golang.org/grpc@v1.54.0/server.go:964 +0x28a
Unfortunately the pods will crash and won't recover automatically. :sweat_smile: Any chance we can ship the fix here as well?
Describe the bug Prometheus 'Label value is not valid UTF-8' cause API request timeout.
To Reproduce Steps to reproduce the behavior:
log:
Expected behavior Other requests not timeout.
Screenshots null
Version (please complete the following information):
What troubleshooting steps did you try? The error is reported by Prometheus, so I add
-e METRICS_BACKEND=expvar
to disable Prometheus.Additional context null