Describe the bug Good morning. We are running Grafana Tempo 2.5.0 using Bitnami Grafana Tempo distributed Helm chart in version: 3.6.3. The following components are deployed: Compactor, Distributed, Memcached, Metrics Generator, Querier, Query Frontend, Vulture and OpenTelemetry Collector in version: 0.83.0.

We have an issue after updating Grafana Tempo 2.1.1 to 2.5.0 with Metrics Generator. The issues with incompatible vParquet version has been mitigated according to documentation: https://grafana.com/docs/tempo/latest/setup/upgrade/#upgrade-to-tempo-25.

The issue we are experiencing is that metrics-generator seems to be idle after start-up:

level=info ts=2024-07-10T09:42:46.762479515Z caller=main.go:225 msg="initialising OpenTracing tracer"
level=info ts=2024-07-10T09:42:46.857883541Z caller=main.go:118 msg="Starting Tempo" version="(version=2.5.0, branch=HEAD, revision=46dad34)"
level=info ts=2024-07-10T09:42:46.860437514Z caller=server.go:240 msg="server listening on addresses" http=[::]:3200 grpc=[::]:9095
level=info ts=2024-07-10T09:42:46.862597894Z caller=memberlist_client.go:435 msg="Using memberlist cluster label and node name" cluster_label= node=grafana-tempo-metrics-generator-ff7856d95-g5zqz-5609c299
level=info ts=2024-07-10T09:42:46.8632617Z caller=module_service.go:82 msg=starting module=internal-server
level=info ts=2024-07-10T09:42:46.863420623Z caller=module_service.go:82 msg=starting module=server
level=info ts=2024-07-10T09:42:46.863484388Z caller=module_service.go:82 msg=starting module=overrides
level=info ts=2024-07-10T09:42:46.863533522Z caller=module_service.go:82 msg=starting module=memberlist-kv
level=info ts=2024-07-10T09:42:46.863997352Z caller=memberlist_client.go:541 msg="memberlist fast-join starting" nodes_found=1 to_join=2
level=info ts=2024-07-10T09:42:46.864106683Z caller=module_service.go:82 msg=starting module=metrics-generator
level=info ts=2024-07-10T09:42:46.964122903Z caller=memberlist_client.go:561 msg="memberlist fast-join finished" joined_nodes=5 elapsed_time=100.127674ms
level=info ts=2024-07-10T09:42:46.964195952Z caller=memberlist_client.go:573 phase=startup msg="joining memberlist cluster" join_members=grafana-tempo-gossip-ring
level=info ts=2024-07-10T09:42:46.964243354Z caller=basic_lifecycler.go:297 msg="instance not found in the ring" instance=grafana-tempo-metrics-generator-ff7856d95-g5zqz ring=metrics-generator
level=info ts=2024-07-10T09:42:46.964631778Z caller=app.go:214 msg="Tempo started"
level=info ts=2024-07-10T09:42:46.981742416Z caller=memberlist_client.go:580 phase=startup msg="joining memberlist cluster succeeded" reached_nodes=5 elapsed_time=17.536478ms

This is the log from metrics-generator that has been started around 60 minutes before.

The relevant configuration is as follows:

multitenancy_enabled: false
usage_report:
  reporting_enabled: false
compactor:
  compaction:
    block_retention: 48h
  ring:
    kvstore:
      store: memberlist
    wait_stability_min_duration: 30s
    wait_stability_max_duration: 2m30s
distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:
    jaeger:
      protocols:
        grpc:
          endpoint: 0.0.0.0:14250
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:55681
        grpc:
          endpoint: 0.0.0.0:4317
querier:
  frontend_worker:
    frontend_address: grafana-tempo-query-frontend:9095
ingester:
  lifecycler:
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
    tokens_file_path: /bitnami/grafana-tempo/data/tokens.json
metrics_generator:
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /bitnami/grafana-tempo/data/wal
    remote_write: 
      - name: prometheus-0
        send_exemplars: true
        url: http://<REDACTED>:9090/api/v1/write
      - name: prometheus-1
        send_exemplars: true
        url: http://<REDACTED>:9090/api/v1/write
  registry:
    external_labels:
      source: tempo
  processor:
    service_graphs:
      max_items: 10000
    span_metrics:
      enable_target_info: true
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
    - grafana-tempo-gossip-ring
overrides:
  per_tenant_override_config: /bitnami/grafana-tempo/conf/overrides.yaml
  defaults:
    metrics_generator:
      processors: ['service-graphs', 'span-metrics']
server:
  http_listen_port: 3200
storage:
  trace:
    backend: s3
    block:
      version: vParquet3
    blocklist_poll: 5m
    s3:
      bucket: <REDACTED>
      endpoint: <REDACTED>
      tls_insecure_skip_verify: true
      access_key: <REDACTED>
      secret_key: <REDACTED>
      bucket_lookup_type: 2
    local:
      path: /bitnami/grafana-tempo/data/traces
    wal:
      path: /bitnami/grafana-tempo/data/wal
cache:
  caches:
  - roles:
    - parquet-footer
    memcached: 
      consistent_hash: true
      host: grafana-tempo-memcached
      service: memcache
      timeout: 500ms

Yet, no service maps or span metrics are visible in Grafana. These metrics were correctly persisted in Prometheus via remote-write on Tempo 2.1.1 but broke after upgrading to 2.5.0. Also, internal metrics defined in the documentation page: https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/ (first 3, 4) are reporting empty data in Prometheus - seems like they are not reported at all.

I think I've followed the documentation to enable the processor in metrics_generator yet, it's idling. Before, I was expecting something like:

level=info ts=2024-06-26T15:16:56.021309322Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122
level=info ts=2024-06-26T15:17:11.061573927Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122
level=info ts=2024-06-26T15:17:26.024328797Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122
level=info ts=2024-06-26T15:17:41.024960204Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122

But I cannot see anything after joining cluster, nothing about connecting to Prometheus, nothing about not being able to connect to Prometheus remote-write, nothing.

Interestingly enough, running Grafana Tempo ./examples/docker-compose/distributed I was able to see the metrics out of the box.

To Reproduce: Steps to reproduce the behavior:

Start Tempo in version 2.5.0 using Bitnami chart distributed in version: 3.6.3
Configure the Metrics Generator according to documentation:
Perform Operations (Write traces)

Expected behavior: Metrics Generator is not idle and metrics are visible in Prometheus metrics.

Environment:

Infrastructure: Kubernetes cluster, Helm deployment
Deployment tool: Helm chart from Bitnami, version: 3.6.3
Grafana version: v11.1.0 (5b85c4c2fc)

same here. but im using official grafana tempo-distributed helm chart

cache:
      caches:
      - memcached:
          consistent_hash: true
          host: 'test-tempo-memcached'
          service: memcached-client
          timeout: 500ms
        roles:
        - parquet-footer
        - bloom
        - frontend-search
    compactor:
      compaction:
        block_retention: 72h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 107374182400
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 10
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 1000
      ring:
        kvstore:
          store: memberlist
    distributor:
      log_received_spans:
        enabled: true
        filter_by_status_error: true
        include_all_attributes: true
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
              max_recv_msg_size_mib: 4194304
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+test-tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          enable_client_server_prefix: true
          enable_messaging_system_latency_histogram: true
          enable_virtual_node_label: true
          histogram_buckets: []
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets: []
      registry:
        collection_interval: 15s
        external_labels:
          collection_interval: 15s
          external_labels:
            source: tempo
          source: tempo
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/generator/wal
        remote_write:
        - headers:
            X-Scope-OrgID: test
          send_exemplars: true
          url: http://observability-mimir-distributor:8080/api/v1/push
        remote_write_add_org_id_header: false
        remote_write_flush_deadline: 1m
      traces_storage:
        path: /var/tempo/generator/traces
    multitenancy_enabled: true
    overrides:
      per_tenant_override_config: /runtime-config/overrides.yaml
    querier:
      frontend_worker:
        frontend_address: test-tempo-query-frontend-discovery:9095
        grpc_client_config:
          max_send_msg_size: 4194304
      max_concurrent_queries: 20
      search:
        external_backend: null
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
        prefer_self: 10
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_outstanding_per_tenant: 2000
      max_retries: 2
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
      trace_by_id:
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 4194304
      grpc_server_max_send_msg_size: 4194304
      http_listen_port: 3100
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        backend: s3
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        pool:
          max_workers: 400
          queue_depth: 20000
        s3:
          access_key: 123
          bucket: tempo
          endpoint: observability-minio:9000
          forcepathstyle: true
          insecure: false
          secret_key: 123
          tls_insecure_skip_verify: true
        wal:
          path: /var/tempo/wal
    usage_report:
      reporting_enabled: false

I'm not sure if this is the limitaion in distributed mode. When I tried it with Docker Compose, it worked because the metrics generator and ingester read and write data to the same local path. However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.

someone from Grafana team can enlighten us on this

In our set-up, ingester indeed has a persistent volume claim:

spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: data-grafana-tempo-ingester-0

But doesn't seem to be mounted into metrics-generator.

Another thing that we've spotted when running the tests was the following behaviour:

level=info ts=2024-07-10T14:50:55.619034542Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=44556
level=info ts=2024-07-10T14:50:55.62145804Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=45901
level=info ts=2024-07-10T14:51:10.515918323Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-10T14:51:25.5159173Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0

At some point during the tests (that in our case generate traffic across 30+ microservices) it basically stopped generating metrics. After I detected that in the morning next day, and restarted the metrics-generator, it stopped after printing the following log lines:

level=info ts=2024-07-11T12:42:06.162281707Z caller=module_service.go:82 msg=initialising module=metrics-generator
level=info ts=2024-07-11T12:42:06.162851639Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=1 to_join=2
level=info ts=2024-07-11T12:42:06.173897462Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=4 elapsed_time=11.048845ms
level=info ts=2024-07-11T12:42:06.173946198Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=test-grafana-tempo-gossip-ring
level=info ts=2024-07-11T12:42:06.17426984Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-6cc769bdd7-zh89p ring=metrics-generator
level=info ts=2024-07-11T12:42:06.174649825Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-11T12:42:06.185328903Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=11.382705ms

Nothing is being printed about even attempting to connect to Prometheus instance, which in our case, is completetly idle, not restarting, working as expected, always on the same URL.

Hey all, can you review these steps:

https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/

This will help us identify where things are being dropped.

However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.

Generators and ingesters do not need shared access to block storage. They are fine to run independently.

Hello @joe-elliott, thank you for reaching out.

As a part of the troubleshooting, our Operations Team has reverted from Grafana Tempo 2.5.0 to 2.1.1 - we had issues with the same behaviour on 2.5.0 that we thought would be fixed by downgrading to 2.1.1 (older version we've run), but it doesn't seem to help - we see the same behaviour on both versions.

I will follow the matching troubleshooting guide from here: https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/

Testing of metrics from similar issue: #2514

sum(increase(traces_service_graph_request_total[1m])) - empty result from last 12 hours, no graph visible
sum(increase(tempo_distributor_ingester_appends_total[1m])) - value of around 300-350 in last 12 hours

sum(increase(tempo_distributor_metrics_generator_pushes_total[1m])) - empty result from last 12 hours, no graph visible
sum(increase(tempo_distributor_metrics_generator_pushes_failures_total[1m])) - empty result from last 12 hours, no graph visible
rate(tempo_metrics_generator_registry_collections_failed_total[1m]) - empty result from last 12 hours, no graph visible

Metrics like: prometheus_remote_storage_samples_total, prometheus_remote_storage_bytes_total or e.g. prometheus_remote_storage_samples_failed_total are reporting empty results in Prometheus for last 24 hours.

Troubleshooting guide

sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m])) - empty result from last 12 hours, no graph visible
sum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m])) - as above,
sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason) - as above,
sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m])) - as above,
All of: prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total - as above,
sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m])) - as above,
sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m])) - as above,
sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m])) - as above.

It seems that metrics-generator is completely not sending the metrics at all, but it's reporting itself as healthy.

It used to work when we had a reinstall of a Helm chart and after some time and tests, it stopped working, reporting active_series=0 for a prolonged time only to be restarted again and not picking up and sending any metrics to Prometheus. So it's not like it never works - it works for some time and then it stops :(

I've attached the example body of the metrics endpoint for metrics-generator that is missing all metrics related to service graph, span metrics and internal metrics as specified above (that are empty):

metrics-generator-example-metrics-endpoint-response-body.txt

Prometheus reports it as scraping correctly:

Also, we've restarted again the Grafana Tempo - clean deployed the Helm chart version 2.1.1.

What we can see now is that the active_series= log is present and what's most important, the following log line is present:

level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant

When we have a situation where metrics-generator is idle, the log from the component ends before this line, so the last line of log is:

level=info ts=2024-07-12T10:05:50.298578151Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-589965b6b5-qdrvt ring=metrics-generator
level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms

Where, for a working metrics-generator it's like this:

level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms
level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant
ts=2024-07-12T10:05:50.481543325Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-0
ts=2024-07-12T10:05:50.481582922Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.481763202Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-1
ts=2024-07-12T10:05:50.481772492Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.483207769Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-0
ts=2024-07-12T10:05:50.483562004Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-1
level=info ts=2024-07-12T10:05:50.48629064Z caller=db.go:336 tenant=single-tenant component=wal msg="replaying WAL, this may take a while" dir=/bitnami/grafana-tempo/data/wal/single-tenant/wal
level=info ts=2024-07-12T10:05:50.486461227Z caller=db.go:387 tenant=single-tenant component=wal msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2024-07-12T10:06:05.514274932Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6074
ts=2024-07-12T10:06:05.551793586Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.068244585s
ts=2024-07-12T10:06:05.552795859Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.069655298s

On a idle metrics-generator, /bitnami/grafana-tempo/data/wal path is empty. But when it correctly starts, it has data:

image|300

/bitnami/grafana-tempo/data is an emptyDir in Kubernetes deployment, as installed from Helm chart from Bitnami.

It stopped again:

level=info ts=2024-07-12T10:43:33.390479119Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:43:48.393775049Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:03.390660555Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:18.379634162Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=0
level=info ts=2024-07-12T10:44:18.379708084Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-12T10:44:33.379406636Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0

As we used to have some metrics for a short period of time, I'd like to get the metrics from troubleshooting guide again:

Troubleshooting guide

sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m])):

Empty

sum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m])):

sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason):

sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m])):

Empty

All of: prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total:

sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m])):

sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m])):

sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m])):

By checking out code locally, 2.1.1, and looking at modules/generator/storage/instance.go, we can conclude that at some point:

// New creates a metrics WAL that remote writes its data.
func New(cfg *Config, tenant string, reg prometheus.Registerer, logger log.Logger) (Storage, error) {
    logger = log.With(logger, "tenant", tenant)
    reg = prometheus.WrapRegistererWith(prometheus.Labels{"tenant": tenant}, reg)

    walDir := filepath.Join(cfg.Path, tenant)

    level.Info(logger).Log("msg", "creating WAL", "dir", walDir) // This is never called

The creating WAL is never called - we cannot see it in logs.

It has a chance to exit before New method can be called in two instances in modules/generator/generator.go:

func (g *Generator) getOrCreateInstance(instanceID string) (*instance, error) {

here:

    inst, ok := g.getInstanceByID(instanceID)
    if ok {
        return inst, nil
    } // <<< HERE

    g.instancesMtx.Lock()
    defer g.instancesMtx.Unlock()

    inst, ok = g.instances[instanceID] // <<< OR HERE
    if ok {
        return inst, nil
    }

        // This is never reached apparently?
    inst, err := g.createInstance(instanceID)
    if err != nil {
        return nil, err
    }
    g.instances[instanceID] = inst
    return inst, nil
}

Why it's not being called - I am not able to tell :( Any help needed here. Thank you.

I would definitely not recommend upgrading from 2.1.x to 2.5.x. We aggressively rotate through block formats and it's quite possible blocks created by 2.1 are not supported by 2.5. also, if you're doing fresh installs. can you just install 2.5?

But I'm not sure if that is causing your issue or not. You're saying that on a fresh install of 2.1.1 it worked for awhile and then quit? Those graphs suggest the distributors just stopped sending data to the generators. Can you check:

if the generators are accessible over the network from the distributors
ring health. check /metrics-generator/ring on the distributors

Why it's not being called - I am not able to tell :( Any help needed here. Thank you.

I believe it's not being called b/c if you don't send data to the generator it will never create an instance. it creates a per tenant instance on demand. the missing log line suggests it's just not receiving any data which is consistent with your other graphs.

In my tempo distributor metrics I found this metric


# HELP tempo_distributor_ingester_clients The current number of ingester clients.
# TYPE tempo_distributor_ingester_clients gauge
tempo_distributor_ingester_clients 1

# HELP tempo_distributor_metrics_generator_clients The current number of metrics-generator clients.
# TYPE tempo_distributor_metrics_generator_clients gauge
tempo_distributor_metrics_generator_clients 0

metrics generator client is not configured it seems.

@nc-gcz IIRC have you tried enabling metrics generator by setting values in global_overrides?

global_overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

https://github.com/grafana/helm-charts/tree/main/charts/tempo-distributed#activate-metrics-generator

grafana / tempo

`metrics-generator` idle and no traces are generated using Tempo 2.5.0 #3855

Testing of metrics from similar issue: #2514

Troubleshooting guide

Troubleshooting guide