Closed javdet closed 1 year ago
Does this occur once you reach a certain threshold of spans or bytes per second? Replication factor 2 is also a weird config. We have used it in the past, but haven't touched it for probably over a year. It's hard to guarantee any behavior when using it.
As I see by metric tempo_distributor_spans_received_total - about 20 spans per second and 2 Mb/s. I've changed replicas to 3 and see the same errors. Additional information: I have 6 distributors and 12 ingestors. Should I increase some limits?
it seems like your spans are 100KB each? that is extremely large for a single span. I'm guessing you will need to increase the write timeout between the distributor and the ingester.
distributor:
ingester_client:
remote_timeout: 5s (default)
You could also try reducing the batch size you push to tempo.
Thanks for advice. I increased remote_timeout up to 20s and reduce batch_size to 4096. It seems that throughput increased but I see a lot the same of errors yet
Having similar issue, but discovered it from another side
having "microservices" setup in kubernetes
in average tempo distributor consumes approx 500mb of ram
but from time to time memory usage spikes dramatically till OOM
hypthis: there is a bunch of services talking to each other and somehow some of them are not sending traces to tempo, which makes distributor queue them till receiving all data - which may describe why its happening
but then I also found such an error in distributor logs
if that's will help I can grab some metrics/and or logs to figure out whats going on
at moment just increasing memory, already to 4gb
but I bed it wont solve the problem
attaching config, but nocing fancy here, except few minor tuning for other issues we have found earlier:
will be glad to provide any additional data
Those are pretty dramatic spikes in memory. What does your traffic look like during those spikes? Can you use tempo_distributor_bytes_received_total
to determine if traffic is increasing?
The distributor does not queue span data, but it does queue metrics data. If you have the metrics generator component enabled then perhaps there is some queueing going on. Maybe check tempo_distributor_forwader_queue_length
? It will tell you if queueing is occuring.
Hm, @joe-elliott thank you for a suggestion, i did not even thing about that, indeed we have metrics enabled with some dimensions
We have linear increase of traffic for a first half of the day and linear decrease in second (e.g. like in usual consumer website)
From curiosity here is what I have found from various metrics:
just for reference memory usage of distributor
memory usage of mertics generator
even so we have overall usual daily increase of traffic both bytes and spans metrics were not spiking
and here is interesting one, 10ish minutes before first memory spike I see increase of queue, but I can not say it is huge
in general right after we start seeing that distributor does not see metrics generator as a client
but theoretically it is kind of fine, e.g. after restart distributor was not added to ring yet, or something like that
from metrics generator side the only thing in logs happened right before is this one:
and from distributor logs I see quadrillion of
failed to pushToQueue traces to tenant single-tenant queue: queue is full
so it seems like indeed the memory usage is related to this queues
then the question will be - isn't there any setting which will prevent this queue to fill up (e.g. just drop metrics in such an case)
the only related issue I have found here is #1541 but it seems like that not relates here
Perhaps try reducing your metrics_generator_forwarder_queue_size
:
overrides:
metrics_generator_forwarder_queue_size: ??
If unset this defaults to 100. I wonder if you have some extremely large trace batches coming in which are filling up this queue and OOMing your distributors?
Hm, interesting, did not saw such an setting in docs, but going to set it to lets say 500 and give it a week or so (unfortunately this issue reproduces randomly few times a week, so can not catch it)
I wonder if you have some extremely large trace batches coming in which are filling up this queue and OOMing your distributors?
Probably no, otherwise we should see a spike in tempo_distributor_bytes_received_total
but it stays the same (~ 800kb)
But just in case checking tempo_distributor_traces_per_batch_count
and tempo_distributor_traces_per_batch_sum
:
not sure which of them is more correct here, but in both cases numbers are not seemed to be quite big (but to be fair I have nothing to compare with)
I find that final graph interesting and perhaps revealing of the issue. It seems that your traffic pattern is changing right as the distributor OOMs are occurring. I believe this is indicating that the average traces per batch is spiking. Batches with tons of tiny traces are more costly to process then batches with fewer large traces.
Perhaps try scaling up distributors and see if it holds better? I'm also wondering if reducing the batch size in your grafana agent or otel collector might help. Generally batching reduces the overall Tempo resources required, but maybe in this case you have too many traces per batch.
What is the CPU usage of your distributors when they start OOMing? are you hitting limits? What do GCs look like?
rate(go_gc_duration_seconds_count{}[1m])
Hm, sounds reasonable (my initial thought was that batch size x2 should not spike memory usage to infinity 🤷♂️, also please take a note that we observing queue increase to 100 at 11:00, but batches increase happens approx 15ish minutes later)
But you are right about CPU usage - we were close to limits
on chart we are looking at: rate(container_cpu_usage_seconds_total{container="tempo-distributor"}[5m])
and max(kube_pod_container_resource_limits{container="tempo-distributor",resource="cpu"})
(e.g. limit is set to 500m, and we were using almost all of it)
and for rate(go_gc_duration_seconds_count{container="tempo-distributor"}[1m])
we have:
also for curiosity (I'm not a go lang engineer but that should be important theoretically), here is what I see for go_goroutines{container=~"tempo-distributor"}
(e.g. from average 100 async we jumped x10 times)
once again, thanks for advices 🙏
I going to:
metrics_generator_forwarder_queue_size
to 500
1000m
🤞 for this to fix everything (PS: we gonna need to create a PR for the troubleshooting page later on 💡)
I'm also wondering if reducing the batch size in your grafana agent or otel collector might help
we did not used any additional collectors (neither garfana agent/otel collector) services are pushing directly to distributor
yeah, based on that dip in GCs, I think you are just pushing these containers to their limit. Once they saturate CPU the garbage collector gives up, memory skyrockets and they OOM.
Just for history, and anyone who will land here later
In our case Monday is the heavies day in terms of traffic and load and seems like we pass it
Here are some charts of what I can see for today:
memory usage was as expected ~ 500mb (no spikes till infinity)
cpu usage is ~50% till its limits (I do believe now it was the main reason, see below)
here is how traffic increases for first half of lightime
And here are two interesting charts:
go garbage collector times (note how they are decreased after 11:00)
queue length (corellates to garbage collector and peaks at max 500 I did override - our previous guess was that queue causes all that but seems like it does not)
but on the same time number of batches and their side did not changed
So, at the moment it is hard to say if that's was the fix (need wait a little bit more)
But at the same time it seems that in case of periodical OOM of distributor we need:
Things to observe more: forwarder queue, but seems it does not affect memory usage, concretely in this case
PS: from all that messages I start imagining how cool it might be to have some kind of script to collect all important metrics from tempo components and post them in one click
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
Describe the bug I have a lot of dropped traces. Errors form distributors
To Reproduce Steps to reproduce the behaviour:
Tempo configuration
Grafana agent remote write
Expected behaviour Distributors don't drop traces
Environment:
Additional Context Metric
tempo_discarded_spans_total