grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.02k stars 521 forks source link

Compactor OOM With Parquet #2162

Closed sherifkayad closed 1 year ago

sherifkayad commented 1 year ago

Describe the bug

After switching to Parquet (and using Tempo 2.0.0) the Compactor is consistently being killed due to out of memory no matter how much memory it gets as limit (currently I am using 2Gi as memory limit and tried increasing to even 10Gi).

The compactor produces logs as follows and then crashes with OOM Killed

level=info ts=2023-03-03T00:17:02.749349082Z caller=compactor.go:144 msg="beginning compaction" traceID=74ce1e9630c2d0cc
level=info ts=2023-03-03T00:17:02.749400303Z caller=compactor.go:156 msg="compacting block" block="&{Version:vParquet BlockID:b83777cc-aba5-4073-b7ef-d82f68ccc9df MinID:[0 0 0 0 0 0 0 0 0 0 61 166 55 147 30 61] MaxID:[0 0 0 0 0 0 0 0 255 255 172 88 192 116 109 247] TenantID:single-tenant StartTime:2023-03-03 00:02:32 +0000 UTC EndTime:2023-03-03 00:02:43 +0000 UTC TotalObjects:6093 Size:5435955 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:6259}"
level=info ts=2023-03-03T00:17:02.798054897Z caller=compactor.go:156 msg="compacting block" block="&{Version:vParquet BlockID:c837c052-23ce-4d9c-8bca-c028edbe0499 MinID:[0 0 0 0 0 0 0 0 0 0 24 173 105 96 32 196] MaxID:[0 0 0 0 0 0 0 0 255 246 222 233 162 21 160 8] TenantID:single-tenant StartTime:2023-03-03 00:02:42 +0000 UTC EndTime:2023-03-03 00:02:52 +0000 UTC TotalObjects:6148 Size:5412819 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:6259}"
level=info ts=2023-03-03T00:17:02.824263888Z caller=compactor.go:156 msg="compacting block" block="&{Version:vParquet BlockID:49de4ce3-082b-4cf0-ab3e-da313643b15d MinID:[0 0 0 0 0 0 0 0 0 17 129 211 38 47 17 72] MaxID:[0 0 0 0 0 0 0 0 255 251 153 153 244 226 179 174] TenantID:single-tenant StartTime:2023-03-03 00:10:22 +0000 UTC EndTime:2023-03-03 00:10:32 +0000 UTC TotalObjects:6189 Size:5341702 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:6259}"
level=info ts=2023-03-03T00:17:02.839158683Z caller=compactor.go:156 msg="compacting block" block="&{Version:vParquet BlockID:1995b431-4d61-4abe-be03-69849c5d9790 MinID:[0 0 0 0 0 0 0 0 0 18 44 7 229 216 178 239] MaxID:[0 0 0 0 0 0 0 0 255 237 158 124 140 178 130 156] TenantID:single-tenant StartTime:2023-03-03 00:00:32 +0000 UTC EndTime:2023-03-03 00:00:43 +0000 UTC TotalObjects:6279 Size:5629672 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:6259}"
level=info ts=2023-03-03T00:17:05.831410254Z caller=poller.go:191 msg="writing tenant index" tenant=single-tenant metas=2910 compactedMetas=0

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo (2.0.0)
  2. Add some traces
  3. Watch the compactor

Expected behavior

The compactor shouldn't crash

Environment:

Additional Context Current Tempo Config:

compactor:
  compaction:
    block_retention: 336h
    compacted_block_retention: 1h
    compaction_cycle: 300s
    compaction_window: 1h
    max_block_bytes: 107374182400
    max_compaction_objects: 6000000
    max_time_per_tenant: 5m
    retention_concurrency: 10
    v2_in_buffer_bytes: 5242880
    v2_out_buffer_bytes: 31457280
    v2_prefetch_traces_count: 1000
  ring:
    kvstore:
      store: memberlist
distributor:
  receivers:
    zipkin:
      endpoint: 0.0.0.0:9411
  ring:
    kvstore:
      store: memberlist
ingester:
  complete_block_timeout: 15m
  flush_check_period: 10s
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 1
    tokens_file_path: /var/tempo/tokens.json
  max_block_bytes: 5242880
  max_block_duration: 30m
  trace_idle_period: 10s
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
  - tempo-gossip-ring
metrics_generator:
  processor:
    service_graphs:
      dimensions:
      - span.region
      - span.cluster
      - namespace
      - app
      - service
      histogram_buckets:
      - 0.1
      - 0.2
      - 0.4
      - 0.8
      - 1.6
      - 3.2
      - 6.4
      - 12.8
      max_items: 100000
      wait: 300s
      workers: 10
    span_metrics:
      dimensions:
      - span.region
      - span.cluster
      - namespace
      - app
      - service
      histogram_buckets:
      - 0.002
      - 0.004
      - 0.008
      - 0.016
      - 0.032
      - 0.064
      - 0.128
      - 0.256
      - 0.512
      - 1.02
      - 2.05
      - 4.1
  registry:
    collection_interval: 15s
    external_labels: {}
    stale_duration: 15m
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /var/tempo/wal
    remote_write:
    - send_exemplars: true
      url: http://kube-prometheus-prometheus.monitoring.svc.cluster.local:9090/api/v1/write
    remote_write_flush_deadline: 1m
    wal: null
multitenancy_enabled: false
overrides:
  max_bytes_per_trace: 500000000
  max_search_bytes_per_trace: 500000
  max_traces_per_user: 1000000
  metrics_generator_processors:
  - service-graphs
  - span-metrics
  per_tenant_override_config: /conf/overrides.yaml
querier:
  frontend_worker:
    frontend_address: tempo-query-frontend-discovery:9095
  max_concurrent_queries: 100
  search:
    external_endpoints: []
    external_hedge_requests_at: 0
    external_hedge_requests_up_to: 2
    prefer_self: 10
    query_timeout: 60s
query_frontend:
  search:
    concurrent_jobs: 2000
    target_bytes_per_job: 400000000
server:
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_listen_port: 3100
  log_format: logfmt
  log_level: info
storage:
  trace:
    backend: s3
    blocklist_poll: 5m
    local:
      path: /var/tempo/traces
    s3:
      bucket: MY-TRACES-BUCKET
      endpoint: s3.eu-central-1.amazonaws.com
      hedge_requests_at: 1s
      hedge_requests_up_to: 2
    wal:
      path: /var/tempo/wal

Additional Note Prior to switching to Parquet and with the exact same configs for the v2 storage format everything was working fine .. so I am going crazy what's now the problem with the Parquet storage format.

Is it maybe related to me allowing a bigger value for max_bytes_per_trace? .. If so, then why was it fine with the v2 format?

sherifkayad commented 1 year ago

Update: despite being a deprecated property, changing compactor.compaction.max_compaction_objects to a much lower value e.g. 1000 and reducing the compactor.compaction.max_block_bytes to e.g. 1073741824 (1 Gigabyte) seems to have fixed the OOM issue.

I keep getting the following message in the compactor logs: level=info ts=2023-03-03T10:02:53.799149003Z caller=compactor.go:107 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant

I am not sure if that's the right behavior now. In S3 I end up having Parquet files with ~5MB to ~6MB size.

joe-elliott commented 1 year ago

So with the changes you made you are likely no longer compacting blocks. This can start to have adverse effect on trace by id search as your blocklist grows quite long. Generally the most likely cause of big spikes in compaction are large traces. It would be interesting to set a max trace size and see if you start getting rejected spans:

overrides:
  max_bytes_per_trace: <some number in bytes>

Another option we've had a lot of success with at Grafana is using the GOMEMLIMIT env var. Set this to something slightly below your mem limit:

   Environment:
      GOMEMLIMIT: 9GiB

Compaction is a batch process. This will slow it down a bit, but it's much better to slow down slightly then OOM and lose an entire batch.

sherifkayad commented 1 year ago

@joe-elliott Thanks for getting back to me on that one. I actually came from changing the max_bytes_per_trace intentionally to a bigger value since a very long time. The reason for that is we had and have indeed larger traces that get rejected with the default value. Since quite some time we are running with the relatively high value of 500000000 (slightly less than 500MB).

It's really interesting though to try the GOMEMLIMIT env var. Let me do that as early as Monday and get back to you on that front. At least by then I would have tons of traces generated in my testing cluster and the compactor would have some serious work to do.

sherifkayad commented 1 year ago

@joe-elliott unfortunately the GOMEMLIMIT didn't do anything for me here .. I am still running OOM at 10Gi memory requests and limits.

What kinds of operations is the compactor performing in memory? Also why isn't that done on disk instead e.g. like Loki or Thanos?

Somehow I fear that the current way how the Tempo Compactor is working could need some serious re-consideration, because I do believe I am not the only one out there with the need to have fairly large traces available.

sherifkayad commented 1 year ago

@joe-elliott after consuming all options, I just reverted back to using the v2 storage format. For me that's working like a charm in terms of performance and compaction (also btw using exactly the same config as above). Of course that means that I would sacrifice the TraceQL (which is something I was really eager to have) .. but I am really going crazy here why is this an issue with Parquet and what the compactor is exactly doing in memory with Parquet that's causing that massive memory need that I can't even limit.

Any ideas except of forcing a reduced max_bytes_per_trace? ..

joe-elliott commented 1 year ago

unfortunately the GOMEMLIMIT didn't do anything for me here .. I am still running OOM at 10Gi memory requests and limits.

I'm sorry to hear that. It's been quite effective for us. Perhaps the heap itself is surpassing 10Gi when you are getting OOMed?

What kinds of operations is the compactor performing in memory? Also why isn't that done on disk instead e.g. like Loki or Thanos?

The primary use of memory in the compactor is translating traces from the Parquet columns into an in-memory trace object, combining them and then re-marshalling to the Parquet columns. It would be possible, but difficult to do this in some kind of piecewise manner using the disk. The size of the blocks doesn't matter b/c Tempo streams the input and output blocks, but the size of the individual traces do matter.

Any ideas except of forcing a reduced max_bytes_per_trace?

I just looked through your config. Do you really have 500MB traces? Yes, I believe that is likely the reason compactors are OOMing and reducing this value would likely help stabilize compactors. We use a max size of 50MB internally and even that feels too big sometimes.

Trace by id lookup on v2 will continue to be supported for the foreseeable future, but we are dropping search on v2 in 2.1. Hopefully we can find some way to mitigate your issues.

sherifkayad commented 1 year ago

@joe-elliott So my final resolution (with stable memory consumption for the compactor) is as follows:

In all cases thanks for your help 😃 .. that really helped me a lot .. just maybe it would be worth the documentation somewhere that using max_bytes_per_trace > 50 MB would lead to memory chaos and catastrophies 😄 .. I think that might help .. I would also love to see further memory enhancements in the future with the compactor (if possible) ..

joe-elliott commented 1 year ago

One quick observation, I see now that the CPU consumption is fairly high (sometimes going as high as 10 CPUs) .. but that's okay as I don't run with CPU limits .. so I am letting the compactor eat as much as Kubernetes can give it ..

CPU is spiking b/c the GC is working extremely hard to keep your pods under 2Gi of used memory. I would push that a lot closer to the 4Gi limit you have specified. 3.5Gi and see what happens?

just maybe it would be worth the documentation somewhere that using max_bytes_per_trace > 50 MB

Yeah, agree. After a couple days can you confirm if this solves your issue? If it does then let's turn this into a docs issue and make sure that we get some lines added that say 500MB is way too big :)

I would also love to see further memory enhancements in the future with the compactor (if possible) ..

Agree. Compactor memory in particular jumped from v2 -> vparquet. We're definitely looking to improve this and the ingester memory. The truth is that it just takes more CPU and memory to pack and unpack the vparquet rows/rowgroups then to marshal the trace proto. This is not avoidable, but I think we can do better than where we are.

sherifkayad commented 1 year ago

CPU is spiking b/c the GC is working extremely hard to keep your pods under 2Gi of used memory. I would push that a lot closer to the 4Gi limit you have specified. 3.5Gi and see what happens?

Well, using 3500MiB as the GOMEMLIMIT solved it like a charm! Now I have slightly less than 1 CPU of usage. Thanks again 😄

Yeah, agree. After a couple days can you confirm if this solves your issue? If it does then let's turn this into a docs issue and make sure that we get some lines added that say 500MB is way too big :)

I can confirm that solves the issue and that a docs todo is needed .. So far I haven't seen any spans discarded .. so highly likey I was over cautious setting the max allowed to 500MB .. Maybe some recommendations from the Grafana experts would be a good fit for the docs e.g. that the optimal max_bytes_per_trace should not somehow exceed 50MB .. If I get some errors on the max size, I might slightly increase above 50 .. however I am crossing my fingers that this doesn't happen .. also as you stated previously, that shouldn't be the case with normal traces 😅

sherifkayad commented 1 year ago

@joe-elliott on a separate track, I always wanted to ask this: what's really the benefit of using multiple Tempo compactors? .. I saw that I can run multiple instances .. but would that bring in a value and do parallel compactions? .. I have let some 2 or 3 compactors run together some while ago and haven't really seen that these are compacting in parallel. Is that correct?

joe-elliott commented 1 year ago

Adding a docs tag. Let's document that with the Parquet backend Tempo has tested max_bytes_per_traces of up to 50MB and we know that with very large sizes say 500MB memory can spike uncontrollably.

I have let some 2 or 3 compactors run together some while ago and haven't really seen that these are compacting in parallel. Is that correct?

Multiple compactors will divide up the "job space" by the compaction_window specified in the compaction config. If your volume is high enough, then yeah, you will want multiple compactors to all work on the blocklist to keep it reasonably low.

sherifkayad commented 1 year ago

got ya thanks for the info on that front.

sherifkayad commented 1 year ago

@joe-elliott just another question if I may .. I notice often we get the tempo_discarded_spans_total metric to some big value and with the reason trace_too_large_to_compact .. Is that a serious issue? Why am I getting that? What could be a potential mitigation?

image

joe-elliott commented 1 year ago

This means the compactor found a trace in it's blocks whose size exceeded your max trace size. Instead of trying to combine both parts of the trace it simply chose one and moved it forward to the next block.

I would review the traces that are being dropped like this. Perhaps there i a way to break them up into smaller traces?

Otel/Tempo/Grafana also support links between traces: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L220

If you do break up a large trace you could use linking to make related traces easier to find.

sherifkayad commented 1 year ago

@joe-elliott that linking seems really great .. I am currently using a setup like: OTLP Collector (cluster x) ==> TEMPO (cluster observability) ==> Grafana (cluster observability)

Can you point me to the configs of the OTLP Collector / Tempo that would make that linking work? .. Is it already the by defauly enabled?

sherifkayad commented 1 year ago

I also understood from you that such drop is not critical, is it?

joe-elliott commented 1 year ago

Can you point me to the configs of the OTLP Collector / Tempo that would make that linking work?

I don't know of a way to automatically break up traces and create links. I believe this has to be done manually. For instance, in go when starting a new span you pass in a SpanStartOption struct which can include the links you want the span to have.

I also understood from you that such drop is not critical, is it?

This is not a critical Tempo error but you have to decide how important losing this trace dats is to you org.

sherifkayad commented 1 year ago

Got ya. Thanks a lot.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

huzaim-umer commented 4 months ago

I there @joe-elliott , hope you will be doing fine. Although the issue is closed but I'm still having issues with handling discarded and refused spams as mentioned below image image I have override the rate_limit_bytes: to 100000000 but still the issue persists. Your response will be much appreciated. Thanks.

joe-elliott commented 4 months ago

It doesn't look like you're dropping spans due to rate limits based on your metrics. Consider increasing max trace size? Also your comment seems unrelated to the issue. Perhaps find something more appropriate or start a new one?