Open erplsf opened 1 year ago
After some time, (I believe this is related to Prometheus pods being relocated to different nodes/restarted), metrics-generator stops shipping new metrics to the Prometheus
This is interesting. Can you reproduce this by purposefully forcing your Prometheus pod to move?
I couldn't find any metrics generated by metrics-generator itself which could help me debug and notice this issue, so maybe a new metric like remote_writes_failed_total would make sense here?
Does metrics generator log anything? I'm also surprised we don't have an obvious metric on failures. Does this counter increase: tempo_metrics_generator_registry_collections_failed_total
?
I'll redeploy metrics-generator, so it "acquires" Prometheus again and then will try to reproduce it.
No, the metrics generator doesn't seem to be logging anything interesting, so here's screenshots of last period of it's running
and the graph for the counter: (even without rate, the raw counter never goes above zero):
It seems like we need a PR to include some useful logging in the processors. Do you have any logs on the remote-write side that might help narrow it down? Based on the logs above, it seems that the series are still being generated.
Nothing interesting on the Prometheus (receiver) side seems to happen - i don't think it logs anything related to remote writes at all by default.
And for testing/trying to reproduce this.
Any more ideas what I can do here? This is pretty major blocker for us in evaluating Tempo, as with this bug present we may need to resort to such hacks as periodically restarting metrics-generator so our dashboards work/can be navigated correctly.
Meanwhile, I'll try to keep an eye on this metric and catch when/how this bug manifests itself again.
Do you know what version of Tempo you're running? Can you try tip of main grafana/tempo:main-1d84273
? We merged this PR which is not in 2.1:
https://github.com/grafana/tempo/pull/2463
This PR correctly handles sharding when doing RW. I'm wondering if the metrics generator is falling behind (b/c it's not sharding) and prom is rejecting the samples b/c they're too old.
Also, the metrics generator publishes prometheus remote write metrics that start with the prefix below. This includes total written bytes, failures, exemplars, etc.
prometheus_remote_storage...
Perhaps dig into these and see if there's a clue on what is causing the failures.
I'm using Tempo (distributed) helm chart v1.4.1
, which is AFAIK mapped to 2.1.1
Tempo version.
I'll try using the tip overnight, (by changing the tag/image used in chart).
Weird, I'm scraping the metrics-generator endpoints, but I don't have any prometheus_remote_storage
metric in my instance.
EDIT: I have those metrics - before trying new image, I'll wait for the re-occurrence and check the metrics first.
metrics-generator stopped writing metrics at around 22 - the pod was restarted (evicted by karpenter due to rebalancing), the only interesting metric I could find that had some values is prometheus_remote_storage_enqueue_retries_total
- but it seems it was/is steadily increasing since the beginning of Tempo deployment:
Both "failed" metrics are stable at zero:
For now I'll restart the pods again, as is, hopefully to replay the WAL, and will see the captured logs from the pods, maybe they say something more.
And in the morning I'll try the new image/tag.
Some more container logs that may help:
as there's no way to override the image tag globally, I'll update just the metrics-generator
to the suggested tag and report in a day if this issue re-occurs.
unfortunately that commit also ships this and helm chart is not updated to handle it, so when I use the new image/tag just for metricsGenerator it goes into crashlooping with the following error:
ailed parsing config: failed to parse configFile /conf/tempo.yaml: yaml: unmarshal errors:
line 109: field tolerate_failed_blocks not found in type frontend.Config
same with the tip of main
- 890838b
.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
this issue still plagues our systems. we're running two replicas of metrics-generator now and still haven't found a plausible cause for it - after some time the metrics just stop appearing in prometheus and there's no error logs or indication on the metrics-generator side that something is wrong. :cry:
Reopening and labeling keepalive. Are you seeing this on Tempo latest: 2.2.1?
I'm seeing it in Tempo 2.2.0. I can update today and report if we'll continue to observe this in 2.2.1.
I can update today and report if we'll continue to observe this in 2.2.1.
There will likely be no changes. There were some remote write improvements in 2.2.0, but the 4 small patches in 2.2.1 will unfortunately not fix your issue.
So to review:
Some things to try/think about:
I'm sorry you are seeing this issue. We do not see it internally which makes it tough to debug.
thank you for the writeup! (un)fortunately I'm leaving on vacation tomorrow, but I'll set up a reminder and will get back to you when I gather enough information to hopefully make it easier to debug.
@erplsf did this resolve for you? I still keep facing the same issue
Same here.
I am on the latest LTGM version on my dev env and on production.
Suddenly overnight metrics-generator stops to produce span metrics in production environment.
I have exactly the same conifigurations in dev and prod. ( The single diffrecne are the storage durations 1d -> 30 d ).
The whole setup of LGTM is with single deployment and local file storage.
I can't see any errors in tempo or mimir. The apps can push the traces and metrics but only the metrics-generator won't produce the metrics from the traces.
Development Env:
Production Env:
Update:
Somehow the metrics genertor working few hours at night and stop again
So we do not see this issue and we'll need more help to diagnose. Can you check relevant metrics/logs on the node to see if any resources are saturated that correlate with the failure? disk issues? OOMed pods? syslog errors? etc
So we do not see this issue and we'll need more help to diagnose. Can you check relevant metrics/logs on the node to see if any resources are saturated that correlate with the failure? disk issues? OOMed pods? syslog errors? etc
Have check all this out.
The logs says there is nothing todo:
level=info ts=2024-06-04T15:23:22.388783918Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:23:37.388506404Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:23:52.389075694Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:07.388806274Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:22.389042666Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:37.389053244Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:52.388918437Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:07.388334347Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:22.388419881Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:37.388802937Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:52.388655141Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:07.388244935Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:22.389210792Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:37.388696048Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:52.388249678Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:07.388179911Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:22.388147373Z caller=registry.go:257 tenant=single-tenant msg="deleted stale series" active_series=0
level=info ts=2024-06-04T15:27:22.388251415Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:37.388463094Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:52.388821661Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:07.388782982Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:17.156958422Z caller=poller.go:241 msg="writing tenant index" tenant=single-tenant metas=2 compactedMetas=2
level=info ts=2024-06-04T15:28:17.158943884Z caller=poller.go:136 msg="blocklist poll complete" seconds=0.002474755
level=info ts=2024-06-04T15:28:22.388833133Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:37.388965088Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:52.38916509Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:07.388431511Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:22.388827088Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:37.388112415Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:52.388795907Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:30:07.38924223Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
The traces on the other hand are there
How can I dig deeper in this issue ?
My issue is resovled. Want report for folks they run in de same issue.,
First of all I have had forget acitvate the metrics sracpping from tempo itself. After activation of the metric scraping and follow the Troubleshoot metrics-generator docu I see the metric shown below.
Furthermore in the systemd logs I see
After looking at the timestamps of the OS from the spans' producers I see a clock skew of more then 40s
Increasing the value of metrics_ingestion_time_range_slack to 60s solves my problem.
Why the NTP clocks have this deviation is the responsibility of the customer of course.
@suikast42, when you refer to 'First of all I have had forget acitvate the metrics sracpping from tempo itself.'
What you have in mind? Is it:
<component>:3200/metrics
endpoint(s) from components?Thanks!
We had a similar issue where metrics-generator stopped pushing metrics after some time. Usually a restart fixed the issue for us. After some debugging I discovered that this was probably related to metrics-generator running into the prometheus remote-write shards limit.
You can also see in the dashboard that the remote-write was simply lagging behind a lot.
I thought I will leave this here, if someone is running into similar issues.
Remote-Write Lag:
(
max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="default/metrics-generator"}[5m])
- ignoring(remote_name, url) group_right
max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="default/metrics-generator"}[5m])
)
Desired vs Max Shards:
avg(prometheus_remote_storage_shards_desired{job="default/metrics-generator",pod=~"$pod"} / prometheus_remote_storage_shards_max{job="default/metrics-generator",pod=~"$pod"})by(pod)
You can look into the Prometheus Remote-Write Tuning Guide for more info.
Additionally we tweaked our Mimir Write-Path in general. Make sure you have enough resources for gateway, distributor and ingesters of course.
Describe the bug I use a k8s service (ClusterIp) as a target for metrics-generator remote_write, f.e.
After some time, (I believe this is related to Prometheus pods being relocated to different nodes/restarted), metrics-generator stops shipping new metrics to the Prometheus, as in - no more data comes in - example of one of the metrics queried in Prometheus:
And distributor/ingestion pipeline is still working correctly - I can search and filter and do all the usual operations with traces, but the service graph and metrics generated from them are lost. Graph showing ingester processing/appending new traces durng that period:
And one more graph showing that distributor does, in fact, send data to mterics-generator:
To Reproduce Steps to reproduce the behavior:
Expected behavior
I expected metrics-generator to continue pushing metrics.
Environment:
Additional Context
I couldn't find any metrics generated by metrics-generator itself which could help me debug and notice this issue, so maybe a new metric like
remote_writes_failed_total
would make sense here?