Tempo metrics-generator stops pushing metrics via remote_write to Prometheus after some time

erplsf commented 1 year ago

Describe the bug I use a k8s service (ClusterIp) as a target for metrics-generator remote_write, f.e.

storage:
    path: /var/tempo/wal
    remote_write:
    - url: http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write
    remote_write_flush_deadline: 1m
    wal: null

After some time, (I believe this is related to Prometheus pods being relocated to different nodes/restarted), metrics-generator stops shipping new metrics to the Prometheus, as in - no more data comes in - example of one of the metrics queried in Prometheus: 2023-05-30_12:53:09

And distributor/ingestion pipeline is still working correctly - I can search and filter and do all the usual operations with traces, but the service graph and metrics generated from them are lost. Graph showing ingester processing/appending new traces durng that period: 2023-05-30_13:02:43

And one more graph showing that distributor does, in fact, send data to mterics-generator: 2023-05-30_13:07:36

To Reproduce Steps to reproduce the behavior:

Start Tempo (SHA or version) with metrics-generator enabled
Configure the remote_write endpoint
Perform Operations (Read/Write/Others)
Observe metrics not being written

Expected behavior

I expected metrics-generator to continue pushing metrics.

Environment:

Infrastructure: AWS EKS, k8s rev: v1.22.17-eks-0a21954
Deployment tool: helm (via Terraform)

Additional Context

I couldn't find any metrics generated by metrics-generator itself which could help me debug and notice this issue, so maybe a new metric like remote_writes_failed_total would make sense here?

joe-elliott commented 1 year ago

After some time, (I believe this is related to Prometheus pods being relocated to different nodes/restarted), metrics-generator stops shipping new metrics to the Prometheus

This is interesting. Can you reproduce this by purposefully forcing your Prometheus pod to move?

I couldn't find any metrics generated by metrics-generator itself which could help me debug and notice this issue, so maybe a new metric like remote_writes_failed_total would make sense here?

Does metrics generator log anything? I'm also surprised we don't have an obvious metric on failures. Does this counter increase: tempo_metrics_generator_registry_collections_failed_total ?

erplsf commented 1 year ago

I'll redeploy metrics-generator, so it "acquires" Prometheus again and then will try to reproduce it.

No, the metrics generator doesn't seem to be logging anything interesting, so here's screenshots of last period of it's running 2023-05-30_15:52:08

and the graph for the counter: (even without rate, the raw counter never goes above zero): 2023-05-30_15:52:39

zalegrala commented 1 year ago

It seems like we need a PR to include some useful logging in the processors. Do you have any logs on the remote-write side that might help narrow it down? Based on the logs above, it seems that the series are still being generated.

erplsf commented 1 year ago

Nothing interesting on the Prometheus (receiver) side seems to happen - i don't think it logs anything related to remote writes at all by default.

And for testing/trying to reproduce this.

I've restarted the metrics-generator, and it replayed it's WAL, so now the graph is filled with the missing data:
Restarting prometheus created a small gap in the metrics, but after both Prometheus pods (we run 2 replicas), came up, metrics-generator started writing again, although without filling in the gap. So I guess the issue isn't related clearly to the Prometheus restarts/pod ip changes.

Any more ideas what I can do here? This is pretty major blocker for us in evaluating Tempo, as with this bug present we may need to resort to such hacks as periodically restarting metrics-generator so our dashboards work/can be navigated correctly.

Meanwhile, I'll try to keep an eye on this metric and catch when/how this bug manifests itself again.

joe-elliott commented 1 year ago

Do you know what version of Tempo you're running? Can you try tip of main grafana/tempo:main-1d84273? We merged this PR which is not in 2.1:

https://github.com/grafana/tempo/pull/2463

This PR correctly handles sharding when doing RW. I'm wondering if the metrics generator is falling behind (b/c it's not sharding) and prom is rejecting the samples b/c they're too old.

Also, the metrics generator publishes prometheus remote write metrics that start with the prefix below. This includes total written bytes, failures, exemplars, etc.

prometheus_remote_storage...

Perhaps dig into these and see if there's a clue on what is causing the failures.

erplsf commented 1 year ago

I'm using Tempo (distributed) helm chart v1.4.1, which is AFAIK mapped to 2.1.1 Tempo version.

I'll try using the tip overnight, (by changing the tag/image used in chart).

~~Weird, I'm scraping the metrics-generator endpoints, but I don't have any prometheus_remote_storage metric in my instance.~~

EDIT: I have those metrics - before trying new image, I'll wait for the re-occurrence and check the metrics first.

erplsf commented 1 year ago

metrics-generator stopped writing metrics at around 22 - the pod was restarted (evicted by karpenter due to rebalancing), the only interesting metric I could find that had some values is prometheus_remote_storage_enqueue_retries_total - but it seems it was/is steadily increasing since the beginning of Tempo deployment: 2023-05-31_22:37:05

Both "failed" metrics are stable at zero: 2023-05-31_22:38:48 2023-05-31_22:38:24

For now I'll restart the pods again, as is, hopefully to replay the WAL, and will see the captured logs from the pods, maybe they say something more.

And in the morning I'll try the new image/tag.

erplsf commented 1 year ago

Some more container logs that may help:

Log snippet

``` level=info ts=2023-05-31T20:49:01.514585942Z caller=main.go:215 msg="initialising OpenTracing tracer" level=info ts=2023-05-31T20:49:01.529641413Z caller=main.go:102 msg="Starting Tempo" version="(version=2.1.1, branch=HEAD, revision=4157d7620)" level=info ts=2023-05-31T20:49:01.531177797Z caller=server.go:334 http=:3100 grpc=:9095 msg="server listening on addresses" level=info ts=2023-05-31T20:49:01.640087Z caller=memberlist_client.go:437 msg="Using memberlist cluster label and node name" cluster_label= node=tempo-metrics-generator-5c675696bb-v69kz-580085e2 level=info ts=2023-05-31T20:49:01.640467134Z caller=module_service.go:82 msg=initialising module=internal-server level=info ts=2023-05-31T20:49:01.640557809Z caller=module_service.go:82 msg=initialising module=server level=info ts=2023-05-31T20:49:01.640896156Z caller=module_service.go:82 msg=initialising module=memberlist-kv level=info ts=2023-05-31T20:49:01.640912915Z caller=module_service.go:82 msg=initialising module=overrides level=info ts=2023-05-31T20:49:01.641315183Z caller=module_service.go:82 msg=initialising module=usage-report level=info ts=2023-05-31T20:49:01.641723235Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=1 to_join=2 level=info ts=2023-05-31T20:49:01.641822883Z caller=module_service.go:82 msg=initialising module=metrics-generator level=info ts=2023-05-31T20:49:01.665417683Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=7 elapsed_time=23.696354ms level=info ts=2023-05-31T20:49:01.665464707Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=tempo-gossip-ring level=info ts=2023-05-31T20:49:01.665532522Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=tempo-metrics-generator-5c675696bb-v69kz ring=metrics-generator level=info ts=2023-05-31T20:49:01.665790893Z caller=app.go:188 msg="Tempo started" level=info ts=2023-05-31T20:49:01.680562849Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=7 elapsed_time=15.098448ms level=info ts=2023-05-31T20:49:01.925215784Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/var/tempo/wal/single-tenant ts=2023-05-31T20:49:01.925941912Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=6e4ccb ts=2023-05-31T20:49:01.925972432Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher" ts=2023-05-31T20:49:01.92606343Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=6e4ccb ts=2023-05-31T20:49:01.926171172Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="error tailing WAL" err="failed to fin d segment for index" level=info ts=2023-05-31T20:49:01.929552853Z caller=db.go:336 tenant=single-tenant component=wal msg="replaying WAL, this may take a while" dir=/var/tempo/wal/single-tenant/wal level=info ts=2023-05-31T20:49:01.929725837Z caller=db.go:387 tenant=single-tenant component=wal msg="WAL segment loaded" segment=0 maxSegment=0 level=info ts=2023-05-31T20:49:16.949711425Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=7754 ts=2023-05-31T20:49:16.966710295Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=10.040017 82s level=info ts=2023-05-31T20:49:31.939344628Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=8162 level=info ts=2023-05-31T20:49:46.941825334Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=8418 level=info ts=2023-05-31T20:50:01.943571598Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=9788 level=info ts=2023-05-31T20:50:16.949477333Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=12975 level=info ts=2023-05-31T20:50:31.956189483Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=19158 level=info ts=2023-05-31T20:50:46.953215883Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=19312 level=info ts=2023-05-31T20:51:01.951997328Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=19409 ```

erplsf commented 1 year ago

as there's no way to override the image tag globally, I'll update just the metrics-generator to the suggested tag and report in a day if this issue re-occurs.

erplsf commented 1 year ago

unfortunately that commit also ships this and helm chart is not updated to handle it, so when I use the new image/tag just for metricsGenerator it goes into crashlooping with the following error:

ailed parsing config: failed to parse configFile /conf/tempo.yaml: yaml: unmarshal errors:
  line 109: field tolerate_failed_blocks not found in type frontend.Config

same with the tip of main - 890838b.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

erplsf commented 1 year ago

this issue still plagues our systems. we're running two replicas of metrics-generator now and still haven't found a plausible cause for it - after some time the metrics just stop appearing in prometheus and there's no error logs or indication on the metrics-generator side that something is wrong. :cry:

joe-elliott commented 1 year ago

Reopening and labeling keepalive. Are you seeing this on Tempo latest: 2.2.1?

erplsf commented 1 year ago

I'm seeing it in Tempo 2.2.0. I can update today and report if we'll continue to observe this in 2.2.1.

joe-elliott commented 1 year ago

I can update today and report if we'll continue to observe this in 2.2.1.

There will likely be no changes. There were some remote write improvements in 2.2.0, but the 4 small patches in 2.2.1 will unfortunately not fix your issue.

So to review:

Prometheus stops receiving metrics from Tempo metric generators after X amount of time
As far as you can tell there is no instability in the generators or in Prometheus
There are no interesting log lines in the generators or Prometheus

Some things to try/think about:

Since this issue was filed I have written up a pretty extensive troubleshooting metrics generator issues page: https://grafana.com/docs/tempo/next/troubleshooting/metrics-generator/ Perhaps review these metrics?
Is the amount of time between restarting the generators and missing metrics variable? or constant?
Is it possible prometheus is dropping the metrics? I'm not familiar with debugging from the prom side, but maybe there's some kind of "max series" setting over there?

I'm sorry you are seeing this issue. We do not see it internally which makes it tough to debug.

erplsf commented 1 year ago

thank you for the writeup! (un)fortunately I'm leaving on vacation tomorrow, but I'll set up a reminder and will get back to you when I gather enough information to hopefully make it easier to debug.

evilgenius-jp commented 9 months ago

@erplsf did this resolve for you? I still keep facing the same issue

suikast42 commented 5 months ago

Same here.

I am on the latest LTGM version on my dev env and on production.

Suddenly overnight metrics-generator stops to produce span metrics in production environment.

I have exactly the same conifigurations in dev and prod. ( The single diffrecne are the storage durations 1d -> 30 d ).

The whole setup of LGTM is with single deployment and local file storage.

I can't see any errors in tempo or mimir. The apps can push the traces and metrics but only the metrics-generator won't produce the metrics from the traces.

Development Env:

Production Env:

suikast42 commented 5 months ago

Update:

Somehow the metrics genertor working few hours at night and stop again

joe-elliott commented 5 months ago

So we do not see this issue and we'll need more help to diagnose. Can you check relevant metrics/logs on the node to see if any resources are saturated that correlate with the failure? disk issues? OOMed pods? syslog errors? etc

suikast42 commented 5 months ago

So we do not see this issue and we'll need more help to diagnose. Can you check relevant metrics/logs on the node to see if any resources are saturated that correlate with the failure? disk issues? OOMed pods? syslog errors? etc

Have check all this out.

The logs says there is nothing todo:

level=info ts=2024-06-04T15:23:22.388783918Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:23:37.388506404Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:23:52.389075694Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:07.388806274Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:22.389042666Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:37.389053244Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:52.388918437Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:07.388334347Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:22.388419881Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:37.388802937Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:52.388655141Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:07.388244935Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:22.389210792Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:37.388696048Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:52.388249678Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:07.388179911Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:22.388147373Z caller=registry.go:257 tenant=single-tenant msg="deleted stale series" active_series=0
level=info ts=2024-06-04T15:27:22.388251415Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:37.388463094Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:52.388821661Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:07.388782982Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:17.156958422Z caller=poller.go:241 msg="writing tenant index" tenant=single-tenant metas=2 compactedMetas=2
level=info ts=2024-06-04T15:28:17.158943884Z caller=poller.go:136 msg="blocklist poll complete" seconds=0.002474755
level=info ts=2024-06-04T15:28:22.388833133Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:37.388965088Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:52.38916509Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:07.388431511Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:22.388827088Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:37.388112415Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:52.388795907Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:30:07.38924223Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0

The traces on the other hand are there

How can I dig deeper in this issue ?

suikast42 commented 5 months ago

My issue is resovled. Want report for folks they run in de same issue.,

First of all I have had forget acitvate the metrics sracpping from tempo itself. After activation of the metric scraping and follow the Troubleshoot metrics-generator docu I see the metric shown below.

Furthermore in the systemd logs I see

After looking at the timestamps of the OS from the spans' producers I see a clock skew of more then 40s

Increasing the value of metrics_ingestion_time_range_slack to 60s solves my problem.

Why the NTP clocks have this deviation is the responsibility of the customer of course.

nc-gcz commented 4 months ago

@suikast42, when you refer to 'First of all I have had forget acitvate the metrics sracpping from tempo itself.'

What you have in mind? Is it:

Setting up Prometheus to scrape from <component>:3200/metrics endpoint(s) from components?
Or is it some specific Grafana Tempo configuration option you are refering to?

Thanks!

msvechla commented 1 week ago

We had a similar issue where metrics-generator stopped pushing metrics after some time. Usually a restart fixed the issue for us. After some debugging I discovered that this was probably related to metrics-generator running into the prometheus remote-write shards limit.

You can also see in the dashboard that the remote-write was simply lagging behind a lot.

I thought I will leave this here, if someone is running into similar issues.

Remote-Write Lag:

(
  max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="default/metrics-generator"}[5m])
  - ignoring(remote_name, url) group_right
  max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="default/metrics-generator"}[5m])
)

Desired vs Max Shards:

avg(prometheus_remote_storage_shards_desired{job="default/metrics-generator",pod=~"$pod"} / prometheus_remote_storage_shards_max{job="default/metrics-generator",pod=~"$pod"})by(pod)

You can look into the Prometheus Remote-Write Tuning Guide for more info.

Additionally we tweaked our Mimir Write-Path in general. Make sure you have enough resources for gateway, distributor and ingesters of course.

grafana / tempo

Tempo metrics-generator stops pushing metrics via remote_write to Prometheus after some time #2514