grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.03k stars 522 forks source link

Errors form distributor "context deadline exceeded" #1624

Closed javdet closed 1 year ago

javdet commented 2 years ago

Describe the bug I have a lot of dropped traces. Errors form distributors

level=error ts=2022-08-08T12:54:50.237305922Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context deadline exceeded"
level=error ts=2022-08-08T12:55:01.377532608Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context canceled"
level=error ts=2022-08-08T12:55:20.983607767Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context deadline exceeded"
level=error ts=2022-08-08T12:55:34.83259462Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context deadline exceeded"
level=error ts=2022-08-08T12:55:38.333709329Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context deadline exceeded"
level=error ts=2022-08-08T12:55:41.23312395Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context canceled"
level=error ts=2022-08-08T12:55:44.841565822Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context canceled"
level=error ts=2022-08-08T12:56:01.016239081Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context canceled"

To Reproduce Steps to reproduce the behaviour:

  1. Start Tempo from helm chart tempo-distributed 0.20.3 and Grafana agent 0.25.1
  2. Perform Operations (Write)

Tempo configuration

multitenancy_enabled: false
search_enabled: true
compactor:
  compaction:
    block_retention: 700h
    iterator_buffer_size: 800
  ring:
    kvstore:
      store: memberlist
distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:
    jaeger:
      protocols:
        grpc:
          endpoint: 0.0.0.0:14250
        thrift_http:
          endpoint: 0.0.0.0:14268
querier:
  frontend_worker:
    frontend_address: tempo-query-frontend-discovery:9095
ingester:
  trace_idle_period: 1m
  lifecycler:
    ring:
      replication_factor: 2
      kvstore:
        store: memberlist
    tokens_file_path: /var/tempo/tokens.json
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
    - tempo-gossip-ring
overrides:
  max_search_bytes_per_trace: 0
  ingestion_burst_size_bytes: 60000000
  ingestion_rate_limit_bytes: 50000000
  max_bytes_per_trace: 9000000
server:
  http_listen_port: 3100
  log_level: info
  log_format: logfmt
  grpc_server_max_recv_msg_size: 4194304
  grpc_server_max_send_msg_size: 4194304
storage:
  trace:
    backend: s3
    s3:
      bucket: m2-tempo-prod
      region: us-central1
      access_key: ******
      secret_key: ******
    blocklist_poll: 5m
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal
    cache: memcached
    memcached:
      consistent_hash: true
      host: tempo-memcached
      service: memcached-client
      timeout: 500ms
metrics_generator:
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /var/tempo/wal

Grafana agent remote write

traces:
      configs:
        - name: jaeger
          receivers:
            ...
          batch:
            send_batch_size: 8192
            timeout: 20s
          remote_write:
            - endpoint: tempo-distributor.observability.svc.cluster.local:14250
              insecure: true
              insecure_skip_verify: true
              protocol: grpc
              format: jaeger
              sending_queue:
                queue_size: 5000
              retry_on_failure:
                max_elapsed_time: 30s

Expected behaviour Distributors don't drop traces

Environment:

Additional Context Metric tempo_discarded_spans_total

tempo_discarded_spans_total{container="distributor", endpoint="http", instance="10.107.156.36:3100", job="tempo-distributor", namespace="observability", pod="tempo-distributor-75db5b54cb-k6tl2", reason="internal_error", service="tempo-distributor", tenant="single-tenant"}
joe-elliott commented 2 years ago

Does this occur once you reach a certain threshold of spans or bytes per second? Replication factor 2 is also a weird config. We have used it in the past, but haven't touched it for probably over a year. It's hard to guarantee any behavior when using it.

javdet commented 2 years ago

As I see by metric tempo_distributor_spans_received_total - about 20 spans per second and 2 Mb/s. I've changed replicas to 3 and see the same errors. Additional information: I have 6 distributors and 12 ingestors. Should I increase some limits?

joe-elliott commented 2 years ago

it seems like your spans are 100KB each? that is extremely large for a single span. I'm guessing you will need to increase the write timeout between the distributor and the ingester.

distributor:
  ingester_client:
    remote_timeout: 5s (default)

You could also try reducing the batch size you push to tempo.

javdet commented 2 years ago

Thanks for advice. I increased remote_timeout up to 20s and reduce batch_size to 4096. It seems that throughput increased but I see a lot the same of errors yet

mac2000 commented 2 years ago

Having similar issue, but discovered it from another side

having "microservices" setup in kubernetes

in average tempo distributor consumes approx 500mb of ram

but from time to time memory usage spikes dramatically till OOM

hypthis: there is a bunch of services talking to each other and somehow some of them are not sending traces to tempo, which makes distributor queue them till receiving all data - which may describe why its happening

but then I also found such an error in distributor logs

if that's will help I can grab some metrics/and or logs to figure out whats going on

at moment just increasing memory, already to 4gb

image

but I bed it wont solve the problem

attaching config, but nocing fancy here, except few minor tuning for other issues we have found earlier:

tempo.yml ```yml multitenancy_enabled: false search_enabled: true server: # http_listen_port: 3200 log_level: info #log_level: debug #log_format: json # this one is related to relatively "big" payloads # seems like this settings do nothing when we are sendin to :4317 (grpc) - we are receving "grpc: received message larger than max (7157694 vs. 4194304)" and sending to :4318 (http) we receinving "received message after decompression larger than max (4194305 vs. 4194304)" # but with this setting :4318 (http) becomes working # default value is 4194304 we are increased it x10 times grpc_server_max_recv_msg_size: 41943040 grpc_server_max_send_msg_size: 41943040 distributor: ring: kvstore: store: memberlist receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 ingester: lifecycler: ring: kvstore: store: memberlist metrics_generator_enabled: true overrides: # this one seems to be causing "size of search data exceeded max search bytes limit" # found at https://github.com/grafana/tempo/commit/09d455bd2f0fdabff318bd2a78e7824b19416c50 max_search_bytes_per_trace: 200000 metrics_generator_processors: - service-graphs - span-metrics metrics_generator: ring: kvstore: store: memberlist processor: service_graphs: # MaxItems is the amount of edges that will be stored in the store. default is 10K # seems to be causing: "skipped processing of spans" dropped X spans max_items: 20000 span_metrics: dimensions: - db.system - db.name # added by http client instrumentation libraries - http.method - http.host - http.route # added by apollo gateway with help of overriden exporter - gateway.service - graphql.operation storage: path: /var/tempo/wal remote_write: - url: http://prometheus/api/v1/write send_exemplars: true querier: frontend_worker: frontend_address: tempo-query-frontend-discovery:9095 compactor: ring: kvstore: store: memberlist compaction: block_retention: 48h storage: trace: backend: azure azure: storage-account-name: REDACTED container-name: REDACTED storage-account-key: REDACTED wal: path: /var/tempo/wal memberlist: abort_if_cluster_join_fails: false join_members: - tempo-gossip-ring ```

will be glad to provide any additional data

joe-elliott commented 2 years ago

Those are pretty dramatic spikes in memory. What does your traffic look like during those spikes? Can you use tempo_distributor_bytes_received_total to determine if traffic is increasing?

The distributor does not queue span data, but it does queue metrics data. If you have the metrics generator component enabled then perhaps there is some queueing going on. Maybe check tempo_distributor_forwader_queue_length ? It will tell you if queueing is occuring.

mac2000 commented 2 years ago

Hm, @joe-elliott thank you for a suggestion, i did not even thing about that, indeed we have metrics enabled with some dimensions

metrics config snippet ```yml metrics_generator_enabled: true overrides: # this one seems to be causing "size of search data exceeded max search bytes limit" # found at https://github.com/grafana/tempo/commit/09d455bd2f0fdabff318bd2a78e7824b19416c50 max_search_bytes_per_trace: 200000 metrics_generator_processors: - service-graphs - span-metrics metrics_generator: ring: kvstore: store: memberlist processor: service_graphs: # MaxItems is the amount of edges that will be stored in the store. default is 10K # seems to be causing: "skipped processing of spans" dropped X spans max_items: 20000 span_metrics: dimensions: - db.system - db.name # added by http client instrumentation libraries - http.method - http.host - http.route # added by apollo gateway with help of overriden exporter - gateway.service - graphql.operation ```

We have linear increase of traffic for a first half of the day and linear decrease in second (e.g. like in usual consumer website)

From curiosity here is what I have found from various metrics:

just for reference memory usage of distributor

image

memory usage of mertics generator

image

even so we have overall usual daily increase of traffic both bytes and spans metrics were not spiking

image

and here is interesting one, 10ish minutes before first memory spike I see increase of queue, but I can not say it is huge

image

in general right after we start seeing that distributor does not see metrics generator as a client

image

but theoretically it is kind of fine, e.g. after restart distributor was not added to ring yet, or something like that


from metrics generator side the only thing in logs happened right before is this one:

image

and from distributor logs I see quadrillion of

failed to pushToQueue traces to tenant single-tenant queue: queue is full

so it seems like indeed the memory usage is related to this queues

then the question will be - isn't there any setting which will prevent this queue to fill up (e.g. just drop metrics in such an case)

the only related issue I have found here is #1541 but it seems like that not relates here

joe-elliott commented 2 years ago

Perhaps try reducing your metrics_generator_forwarder_queue_size:

overrides:
  metrics_generator_forwarder_queue_size: ??

If unset this defaults to 100. I wonder if you have some extremely large trace batches coming in which are filling up this queue and OOMing your distributors?

mac2000 commented 2 years ago

Hm, interesting, did not saw such an setting in docs, but going to set it to lets say 500 and give it a week or so (unfortunately this issue reproduces randomly few times a week, so can not catch it)

I wonder if you have some extremely large trace batches coming in which are filling up this queue and OOMing your distributors?

Probably no, otherwise we should see a spike in tempo_distributor_bytes_received_total but it stays the same (~ 800kb)

But just in case checking tempo_distributor_traces_per_batch_count and tempo_distributor_traces_per_batch_sum:

image image

not sure which of them is more correct here, but in both cases numbers are not seemed to be quite big (but to be fair I have nothing to compare with)

joe-elliott commented 2 years ago

I find that final graph interesting and perhaps revealing of the issue. It seems that your traffic pattern is changing right as the distributor OOMs are occurring. I believe this is indicating that the average traces per batch is spiking. Batches with tons of tiny traces are more costly to process then batches with fewer large traces.

Perhaps try scaling up distributors and see if it holds better? I'm also wondering if reducing the batch size in your grafana agent or otel collector might help. Generally batching reduces the overall Tempo resources required, but maybe in this case you have too many traces per batch.

What is the CPU usage of your distributors when they start OOMing? are you hitting limits? What do GCs look like?

rate(go_gc_duration_seconds_count{}[1m])

mac2000 commented 2 years ago

Hm, sounds reasonable (my initial thought was that batch size x2 should not spike memory usage to infinity 🤷‍♂️, also please take a note that we observing queue increase to 100 at 11:00, but batches increase happens approx 15ish minutes later)

But you are right about CPU usage - we were close to limits

image

on chart we are looking at: rate(container_cpu_usage_seconds_total{container="tempo-distributor"}[5m]) and max(kube_pod_container_resource_limits{container="tempo-distributor",resource="cpu"}) (e.g. limit is set to 500m, and we were using almost all of it)

and for rate(go_gc_duration_seconds_count{container="tempo-distributor"}[1m]) we have:

image

also for curiosity (I'm not a go lang engineer but that should be important theoretically), here is what I see for go_goroutines{container=~"tempo-distributor"} (e.g. from average 100 async we jumped x10 times)

image

once again, thanks for advices 🙏

I going to:

🤞 for this to fix everything (PS: we gonna need to create a PR for the troubleshooting page later on 💡)


I'm also wondering if reducing the batch size in your grafana agent or otel collector might help

we did not used any additional collectors (neither garfana agent/otel collector) services are pushing directly to distributor

joe-elliott commented 2 years ago

yeah, based on that dip in GCs, I think you are just pushing these containers to their limit. Once they saturate CPU the garbage collector gives up, memory skyrockets and they OOM.

mac2000 commented 2 years ago

Just for history, and anyone who will land here later

In our case Monday is the heavies day in terms of traffic and load and seems like we pass it

Here are some charts of what I can see for today:

memory usage was as expected ~ 500mb (no spikes till infinity)

image

cpu usage is ~50% till its limits (I do believe now it was the main reason, see below)

image

here is how traffic increases for first half of lightime

image

And here are two interesting charts:

go garbage collector times (note how they are decreased after 11:00)

image

queue length (corellates to garbage collector and peaks at max 500 I did override - our previous guess was that queue causes all that but seems like it does not)

image

but on the same time number of batches and their side did not changed

image image

So, at the moment it is hard to say if that's was the fix (need wait a little bit more)

But at the same time it seems that in case of periodical OOM of distributor we need:

Things to observe more: forwarder queue, but seems it does not affect memory usage, concretely in this case


PS: from all that messages I start imagining how cool it might be to have some kind of script to collect all important metrics from tempo components and post them in one click

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.