Closed sherifkayad closed 1 year ago
Update:
despite being a deprecated property, changing compactor.compaction.max_compaction_objects
to a much lower value e.g. 1000
and reducing the compactor.compaction.max_block_bytes
to e.g. 1073741824
(1 Gigabyte) seems to have fixed the OOM issue.
I keep getting the following message in the compactor logs:
level=info ts=2023-03-03T10:02:53.799149003Z caller=compactor.go:107 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
I am not sure if that's the right behavior now. In S3 I end up having Parquet files with ~5MB to ~6MB size.
So with the changes you made you are likely no longer compacting blocks. This can start to have adverse effect on trace by id search as your blocklist grows quite long. Generally the most likely cause of big spikes in compaction are large traces. It would be interesting to set a max trace size and see if you start getting rejected spans:
overrides:
max_bytes_per_trace: <some number in bytes>
Another option we've had a lot of success with at Grafana is using the GOMEMLIMIT
env var. Set this to something slightly below your mem limit:
Environment:
GOMEMLIMIT: 9GiB
Compaction is a batch process. This will slow it down a bit, but it's much better to slow down slightly then OOM and lose an entire batch.
@joe-elliott Thanks for getting back to me on that one. I actually came from changing the max_bytes_per_trace
intentionally to a bigger value since a very long time. The reason for that is we had and have indeed larger traces that get rejected with the default value. Since quite some time we are running with the relatively high value of 500000000
(slightly less than 500MB).
It's really interesting though to try the GOMEMLIMIT
env var. Let me do that as early as Monday and get back to you on that front. At least by then I would have tons of traces generated in my testing cluster and the compactor would have some serious work to do.
@joe-elliott unfortunately the GOMEMLIMIT
didn't do anything for me here .. I am still running OOM at 10Gi memory requests and limits.
What kinds of operations is the compactor performing in memory? Also why isn't that done on disk instead e.g. like Loki or Thanos?
Somehow I fear that the current way how the Tempo Compactor is working could need some serious re-consideration, because I do believe I am not the only one out there with the need to have fairly large traces available.
@joe-elliott after consuming all options, I just reverted back to using the v2
storage format. For me that's working like a charm in terms of performance and compaction (also btw using exactly the same config as above). Of course that means that I would sacrifice the TraceQL
(which is something I was really eager to have) .. but I am really going crazy here why is this an issue with Parquet and what the compactor is exactly doing in memory with Parquet that's causing that massive memory need that I can't even limit.
Any ideas except of forcing a reduced max_bytes_per_trace
? ..
unfortunately the GOMEMLIMIT didn't do anything for me here .. I am still running OOM at 10Gi memory requests and limits.
I'm sorry to hear that. It's been quite effective for us. Perhaps the heap itself is surpassing 10Gi when you are getting OOMed?
What kinds of operations is the compactor performing in memory? Also why isn't that done on disk instead e.g. like Loki or Thanos?
The primary use of memory in the compactor is translating traces from the Parquet columns into an in-memory trace object, combining them and then re-marshalling to the Parquet columns. It would be possible, but difficult to do this in some kind of piecewise manner using the disk. The size of the blocks doesn't matter b/c Tempo streams the input and output blocks, but the size of the individual traces do matter.
Any ideas except of forcing a reduced max_bytes_per_trace?
I just looked through your config. Do you really have 500MB traces? Yes, I believe that is likely the reason compactors are OOMing and reducing this value would likely help stabilize compactors. We use a max size of 50MB internally and even that feels too big sometimes.
Trace by id lookup on v2
will continue to be supported for the foreseeable future, but we are dropping search on v2
in 2.1. Hopefully we can find some way to mitigate your issues.
@joe-elliott So my final resolution (with stable memory consumption for the compactor) is as follows:
max_bytes_per_trace
) to the value you mentioned is being used by Grafana 50000000
(50 MB)GOMEMLIMIT
env var to half of the given memory requests / limits (I currently managed to reduce (on my testing cluster) to 4Gi of memory requests & limits .. so basically I set the GOMEMLIMIT
to 2GiB
In all cases thanks for your help 😃 .. that really helped me a lot .. just maybe it would be worth the documentation somewhere that using max_bytes_per_trace
> 50 MB would lead to memory chaos and catastrophies 😄 .. I think that might help .. I would also love to see further memory enhancements in the future with the compactor (if possible) ..
One quick observation, I see now that the CPU consumption is fairly high (sometimes going as high as 10 CPUs) .. but that's okay as I don't run with CPU limits .. so I am letting the compactor eat as much as Kubernetes can give it ..
CPU is spiking b/c the GC is working extremely hard to keep your pods under 2Gi of used memory. I would push that a lot closer to the 4Gi limit you have specified. 3.5Gi and see what happens?
just maybe it would be worth the documentation somewhere that using max_bytes_per_trace > 50 MB
Yeah, agree. After a couple days can you confirm if this solves your issue? If it does then let's turn this into a docs issue and make sure that we get some lines added that say 500MB is way too big :)
I would also love to see further memory enhancements in the future with the compactor (if possible) ..
Agree. Compactor memory in particular jumped from v2
-> vparquet
. We're definitely looking to improve this and the ingester memory. The truth is that it just takes more CPU and memory to pack and unpack the vparquet rows/rowgroups then to marshal the trace proto. This is not avoidable, but I think we can do better than where we are.
CPU is spiking b/c the GC is working extremely hard to keep your pods under 2Gi of used memory. I would push that a lot closer to the 4Gi limit you have specified. 3.5Gi and see what happens?
Well, using 3500MiB
as the GOMEMLIMIT
solved it like a charm! Now I have slightly less than 1 CPU of usage. Thanks again 😄
Yeah, agree. After a couple days can you confirm if this solves your issue? If it does then let's turn this into a docs issue and make sure that we get some lines added that say 500MB is way too big :)
I can confirm that solves the issue and that a docs todo is needed .. So far I haven't seen any spans discarded .. so highly likey I was over cautious setting the max allowed to 500MB .. Maybe some recommendations from the Grafana experts would be a good fit for the docs e.g. that the optimal max_bytes_per_trace
should not somehow exceed 50MB .. If I get some errors on the max size, I might slightly increase above 50 .. however I am crossing my fingers that this doesn't happen .. also as you stated previously, that shouldn't be the case with normal traces 😅
@joe-elliott on a separate track, I always wanted to ask this: what's really the benefit of using multiple Tempo compactors? .. I saw that I can run multiple instances .. but would that bring in a value and do parallel compactions? .. I have let some 2 or 3 compactors run together some while ago and haven't really seen that these are compacting in parallel. Is that correct?
Adding a docs tag. Let's document that with the Parquet backend Tempo has tested max_bytes_per_traces
of up to 50MB and we know that with very large sizes say 500MB memory can spike uncontrollably.
I have let some 2 or 3 compactors run together some while ago and haven't really seen that these are compacting in parallel. Is that correct?
Multiple compactors will divide up the "job space" by the compaction_window
specified in the compaction config. If your volume is high enough, then yeah, you will want multiple compactors to all work on the blocklist to keep it reasonably low.
got ya thanks for the info on that front.
@joe-elliott just another question if I may .. I notice often we get the tempo_discarded_spans_total
metric to some big value and with the reason trace_too_large_to_compact
.. Is that a serious issue? Why am I getting that? What could be a potential mitigation?
This means the compactor found a trace in it's blocks whose size exceeded your max trace size. Instead of trying to combine both parts of the trace it simply chose one and moved it forward to the next block.
I would review the traces that are being dropped like this. Perhaps there i a way to break them up into smaller traces?
Otel/Tempo/Grafana also support links between traces: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L220
If you do break up a large trace you could use linking to make related traces easier to find.
@joe-elliott that linking seems really great .. I am currently using a setup like:
OTLP Collector (cluster x)
==> TEMPO (cluster observability)
==> Grafana (cluster observability)
Can you point me to the configs of the OTLP Collector / Tempo that would make that linking work? .. Is it already the by defauly enabled?
I also understood from you that such drop is not critical, is it?
Can you point me to the configs of the OTLP Collector / Tempo that would make that linking work?
I don't know of a way to automatically break up traces and create links. I believe this has to be done manually. For instance, in go when starting a new span you pass in a SpanStartOption struct which can include the links you want the span to have.
I also understood from you that such drop is not critical, is it?
This is not a critical Tempo error but you have to decide how important losing this trace dats is to you org.
Got ya. Thanks a lot.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
I there @joe-elliott , hope you will be doing fine. Although the issue is closed but I'm still having issues with handling discarded and refused spams as mentioned below I have override the rate_limit_bytes: to 100000000 but still the issue persists. Your response will be much appreciated. Thanks.
It doesn't look like you're dropping spans due to rate limits based on your metrics. Consider increasing max trace size? Also your comment seems unrelated to the issue. Perhaps find something more appropriate or start a new one?
Describe the bug
After switching to Parquet (and using Tempo 2.0.0) the Compactor is consistently being killed due to out of memory no matter how much memory it gets as limit (currently I am using 2Gi as memory limit and tried increasing to even 10Gi).
The compactor produces logs as follows and then crashes with OOM Killed
To Reproduce Steps to reproduce the behavior:
Expected behavior
The compactor shouldn't crash
Environment:
Additional Context Current Tempo Config:
Additional Note Prior to switching to Parquet and with the exact same configs for the
v2
storage format everything was working fine .. so I am going crazy what's now the problem with the Parquet storage format.Is it maybe related to me allowing a bigger value for
max_bytes_per_trace
? .. If so, then why was it fine with thev2
format?