panic: duplicate metrics collector registration attempted

automatedops commented 2 years ago

Describe the bug When I configure metrics generator on Grafana Tempo, I had a misconfigured TLS path.

But metrics generator pod boots up and crashed immediately.

In the full crash log, it show creating WAL multiple times, which is leading to this crash. I think it will be helpful if we can exit the program on any misconfigured remote write endpoints.

Feel free to suggest other error handling approach.

To Reproduce Steps to reproduce the behavior:

Start Tempo (1.4.1)
Perform Operations (Read/Write/Others)
- Deploy a metrics generator
- Set remote write TLS path set to a non-existing location
- Process/Pod will panic with panic: duplicate metrics collector registration attempted

Expected behavior

Process or pod should exit if TLS cert file does not exist

Environment:

Infrastructure: [Kubernetes]
Deployment tool: [helm with Tempo chart 0.17.4]
Metrics generator is deployed via yaml

Additional Context Sample configuration

  metrics_generator:
        ring:
          kvstore:
            store: memberlist
        processor:
          service_graphs:
            max_items: 20000
            dimensions: ['cluster', 'namespace', 'service.name']
          span_metrics:
            # Default buckets: [0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.02, 2.05, 4.10]
            histogram_buckets: [0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.02, 2.05, 4.10, 8.19, 16.38, 32.77]
            dimensions: ['cluster', 'namespace', 'service.name']
        storage:
          path: /var/tempo/wal
          remote_write:
          # Dev cluster remote write to dev Thanos receive
          - url: http://thanos-receive-route.thanos.svc.cluster.local:19291/api/v1/receive
            send_exemplars: true
            write_relabel_configs:
            - source_labels: [cluster]
              target_label: exported_cluster
              action: replace

          # Dev cluster remote write to non-prod Thanos receive
          - url: https://thanos-receive-nonprod.thanos.svc.cluster.local:19291/api/v1/receive
            send_exemplars: true
            tls_config:
              insecure_skip_verify: false
              # This file (/etc/tls/ca-bundle.crt) does not exist
              ca_file: /etc/tls/ca-bundle.crt
            write_relabel_configs:
            - source_labels: [is_prod]
              regex: true
              action: drop
            - source_labels: [cluster]
              target_label: exported_cluster
              action: replace

Full Panic log

level=info ts=2022-05-25T19:24:59.697477161Z caller=main.go:191 msg="initialising OpenTracing tracer"
level=info ts=2022-05-25T19:24:59.698326455Z caller=main.go:106 msg="Starting Tempo" version="(version=, branch=HEAD, revision=d3880a979)"
level=info ts=2022-05-25T19:24:59.698541401Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-05-25T19:24:59.699267041Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=tempo-dev-tempo-distributed-metrics-generator-1-7cdc81ff
level=info ts=2022-05-25T19:24:59.699310091Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-05-25T19:24:59.699378051Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-05-25T19:24:59.699391595Z caller=module_service.go:64 msg=initialising module=overrides
level=info ts=2022-05-25T19:24:59.699992191Z caller=module_service.go:64 msg=initialising module=metrics-generator
level=info ts=2022-05-25T19:24:59.700977765Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=tempo-dev-tempo-distributed-metrics-generator-1 ring=metrics-generator
level=info ts=2022-05-25T19:24:59.701209972Z caller=app.go:284 msg="Tempo started"
level=info ts=2022-05-25T19:24:59.722763569Z caller=memberlist_client.go:513 msg="joined memberlist cluster" reached_nodes=12
level=info ts=2022-05-25T19:25:00.693203119Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/var/tempo/wal/single-tenant
level=warn ts=2022-05-25T19:25:00.704466882Z caller=grpc_logging.go:38 method=/tempopb.MetricsGenerator/PushSpans duration=11.367267ms err="unable to load specified CA cert /etc/tls/ca-bundle.crt: open /etc/tls/ca-bundle.crt: no such file or directory" msg="gRPC\n"
level=info ts=2022-05-25T19:25:00.859014381Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/var/tempo/wal/single-tenant
panic: duplicate metrics collector registration attempted

goroutine 270 [running]:
github.com/prometheus/client_golang/prometheus.(*wrappingRegisterer).MustRegister(0xc00108c2a0, {0xc000a15eb0, 0x1, 0x1dc9cb4})
    /drone/src/vendor/github.com/prometheus/client_golang/prometheus/wrap.go:104 +0x151
github.com/prometheus/prometheus/tsdb/wal.NewWatcherMetrics({0x2259ca8, 0xc00108c2a0})
    /drone/src/vendor/github.com/prometheus/prometheus/tsdb/wal/watcher.go:139 +0x374
github.com/prometheus/prometheus/storage/remote.NewWriteStorage({0x223be60, 0xc00043d040}, {0x2259ca8, 0xc00108c2a0}, {0xc000acfb20, 0x1c}, 0xdf8475800, {0x223af00, 0x32cf470})
    /drone/src/vendor/github.com/prometheus/prometheus/storage/remote/write.go:77 +0xa5
github.com/prometheus/prometheus/storage/remote.NewStorage({0x223a940, 0xc0006a4230}, {0x2259ca8, 0xc00108c2a0}, 0x1e6b688, {0xc000acfb20, 0x1c}, 0x0, {0x223af00, 0x32cf470})
    /drone/src/vendor/github.com/prometheus/prometheus/storage/remote/storage.go:75 +0x116
github.com/grafana/tempo/modules/generator/storage.New(0xc000b5cfa0, {0x1dce4fd, 0xd}, {0x2259c78, 0xc0000bca50}, {0x223a940, 0xc0002f7360})
    /drone/src/modules/generator/storage/instance.go:60 +0x4b6
github.com/grafana/tempo/modules/generator.(*Generator).createInstance(0xc00063b710, {0x1dce4fd, 0xd})
    /drone/src/modules/generator/generator.go:231 +0x4f
github.com/grafana/tempo/modules/generator.(*Generator).getOrCreateInstance(0xc00063b710, {0x1dce4fd, 0xd})
    /drone/src/modules/generator/generator.go:214 +0xe5
github.com/grafana/tempo/modules/generator.(*Generator).PushSpans(0xc00063b710, {0x22776c8, 0xc00108c1e0}, 0x1a306e0)
    /drone/src/modules/generator/generator.go:190 +0x1b2
github.com/grafana/tempo/pkg/tempopb._MetricsGenerator_PushSpans_Handler.func1({0x22776c8, 0xc00108c1e0}, {0x1cf4600, 0xc00028fcc8})
    /drone/src/pkg/tempopb/tempo.pb.go:1246 +0x78
github.com/grafana/tempo/cmd/tempo/app.glob..func2({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8}, 0xc000f67678, 0xc000b993c8)
    /drone/src/cmd/tempo/app/fake_auth.go:22 +0x83
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8})
    /drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8}, 0xc000443ba0, 0xc000443bc0)
    /drone/src/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:33 +0xa2
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8})
    /drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8}, 0xc000443ba0, 0xc000443be0)
    /drone/src/vendor/github.com/opentracing-contrib/go-grpc/server.go:57 +0x40f
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8})
    /drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor({{0x22a3e18, 0xc000af6300}, 0x20}, {0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8}, 0xc000443ba0, 0xc000443c00)
    /drone/src/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:29 +0xbe
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8})
    /drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8}, 0xc00098ebb8, 0x1ae64a0)
    /drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xbf
github.com/grafana/tempo/pkg/tempopb._MetricsGenerator_PushSpans_Handler({0x1d3d640, 0xc00063b710}, {0x22776c8, 0xc00108c840}, 0xc00064c480, 0xc000b58a50)
    /drone/src/pkg/tempopb/tempo.pb.go:1248 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000ad6a80, {0x229c4d0, 0xc00056da00}, 0xc00063dd40, 0xc000b7e180, 0x326ab20, 0x0)
    /drone/src/vendor/google.golang.org/grpc/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc000ad6a80, {0x229c4d0, 0xc00056da00}, 0xc00063dd40, 0x0)
    /drone/src/vendor/google.golang.org/grpc/server.go:1616 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
    /drone/src/vendor/google.golang.org/grpc/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /drone/src/vendor/google.golang.org/grpc/server.go:919 +0x294

automatedops commented 2 years ago

Another note. After fixing TLS path in my case. Metrics generate wasn't able to send metrics properly. I have to delete wal files on all my pods to recover it. Maybe calling Creating WAL multiple times caused some kind of corruption on the wal files.

joe-elliott commented 2 years ago

Nice find. I agree with your thoughts about exiting the process on failure to load the TLS file. We will also look into the issue with WAL corruption.

yvrhdn commented 2 years ago

Okay, I think we are not cleaning up resources properly after running into an error. The remote write config is loaded after the WAL has been created:

https://github.com/grafana/tempo/blob/f883e844aeef2b21bc9567ac09eddcc227a9ae2e/modules/generator/storage/instance.go#L66-L69

If this errors, we just exit the function but do not destroy the newly created WAL. When a second batch is pushed we try to create the WAL again and this causes the duplicate registration.

I'll check if we can validate the remote write config in advance. If not, we should just make sure we clean up new resources before returning the error.

About the corrupted WAL: creating the WAL multiple times should be fine. The next invocations will reuse the same directory. I'm thinking the panic might have disrupted some async process, leaving the WAL in an invalid state.

yvrhdn commented 2 years ago

So this is an issue upstream in Prometheus: if the first attempt to create the remote write structure fails, not all resources are cleaned up correctly and the second attempt will panic when registering metrics. We can't fix this from Tempo code. I've created an issue upstream: https://github.com/prometheus/prometheus/issues/10779

joe-elliott commented 2 years ago

@kvrhdn in the event that we can't create the WAL or whatever remote write structure is having issues, should we just exit cleanly instead of trying again?

Could this impact us during normal operations? or only on startup?

yvrhdn commented 2 years ago

We create these structures when we first receive data for that tenant. Since we don't know in advance which tenants are active, we can't create them at startup. We can try to validate the config at start up, but maintaining this ourselves will be tricky...

We could deliberately exit the process when creating the remote write structures fails, but I don't know if we can do that in a clean way. Btw, an error in the remote write config will most likely impact every tenant, so in practice the first tenant that sends data will trigger this error.

nayanamana commented 2 years ago

I have configured the metrics generator to write to Prometheus remote-write storage. However, no metrics are received to Prometheus, although some data is written to the Prometheus wal directory.

As in the panic log above, I am seeing a similar message as below, when starting the metric generator. Could this issue that outputs the below message, cause the metric generator not to send the metrics to Prometheus?

"level=info ts=2022-05-25T19:24:59.700977765Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=ivapp14... ring=metrics-generator level=info ts=2022-05-25T19:24:59.701209972Z caller=app.go:284 msg="Tempo started""

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

grafana / tempo

panic: duplicate metrics collector registration attempted #1449