Retrieving big traces fails with "Internal Server Error Body: response larger than the max" & Query FE OOM

sherifkayad commented 2 years ago

Describe the bug

Adding a huge trace to Tempo isn't a problem thanks to the grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size that we are setting on the server to 100MB (104857600)

However, retrieving these (by ID e.g. from Grafana) causes really a weird behavior as follows:

In scenario one, we leave the grpc_client_config of the frontend worker to the default values => we get an error as follows: failed to get trace with id: ea233ba0fabfdf58 Status: 500 Internal Server Error Body: response larger than the max (45940141 vs 16777216)
In scenario two, we modify the grpc_client_config of the frontend worker as follows:
```
grpc_client_config
max_recv_msg_size: 33554432
max_send_msg_size: 33554432
```
In that case we get a similar error message failed to get trace with id: ea233ba0fabfdf58 Status: 500 Internal Server Error Body: response larger than the max (45940141 vs 33554432)
In scenario three, we go further and attempt to increase the max_recv_msg_size and max_send_msg_size of the grpc_client_config of the frontend worker to 64MB (67108864) and in that case the Query Frontend dies on us with an OOM error. Nevertheless, the memory usage looks fairly fine and we anyways granted the container round about 5 Gigs of RAM.

To Reproduce Steps to reproduce the behavior: Try one of the scenarios above with Tempo version 1.4.1 in microservices mode.

Expected behavior

The retrieval of the trace by ID should work fine (especially when the grpc_client_config is set to a high value)

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm // Tempo Distributed Helm Chart

Additional Context

My current config for Tempo:

multitenancy_enabled: false
search_enabled: true
metrics_generator_enabled: false
compactor:
  compaction:
    block_retention: 744h
  ring:
    kvstore:
      store: memberlist
distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:
    zipkin:
      endpoint: 0.0.0.0:9411
querier:
  query_timeout: 300s
  max_concurrent_queries: 10
  search:
    query_timeout: 300s
  frontend_worker:
    frontend_address: tempo-tempo-distributed-query-frontend-discovery:9095
    grpc_client_config:
      max_recv_msg_size: 33554432
      max_send_msg_size: 33554432
ingester:
  lifecycler:
    ring:
      replication_factor: 3
      kvstore:
        store: memberlist
    tokens_file_path: /var/tempo/tokens.json
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
    - tempo-tempo-distributed-gossip-ring
overrides:
  max_bytes_per_trace: 500000000
  max_search_bytes_per_trace: 500000
  metrics_generator_processors:
  - service-graphs
  - span-metrics
  per_tenant_override_config: /conf/overrides.yaml
server:
  http_listen_port: 3100
  log_level: info
  log_format: logfmt
  grpc_server_max_recv_msg_size: 1.048576e+08
  grpc_server_max_send_msg_size: 1.048576e+08
storage:
  trace:
    backend: s3
    s3:
      bucket: MY_BUCKET
      endpoint: s3.eu-central-1.amazonaws.com
    blocklist_poll: 5m
    pool:
      max_workers: 200
      queue_depth: 20000
    wal:
      path: /var/tempo/wal
    local:  
      path: /var/tempo/traces
    # Disable Memcached for now ..
    # cache: memcached
    # memcached:
    #   consistent_hash: true
    #   host: tempo-tempo-distributed-memcached
    #   service: memcached-client
    #   timeout: 500ms
    block:
      encoding: zstd

sherifkayad commented 2 years ago

Just for the sake of completeness, I followed this community post here https://community.grafana.com/t/how-to-query-a-500mb-trace/44052/2 and I am really wondering what I am doing wrong .. As it seems to me the person who raised the question has the same issue like mine, however, he managed to solve it somehow ..

zalegrala commented 2 years ago

Thanks for filing an issue. I'll see if I can reproduce this based on your config.

sherifkayad commented 2 years ago

@zalegrala amazing! Let me know if you need any further info from my side

zalegrala commented 2 years ago

I've been able to reproduce all but the third scenario there. In my case testing with a 75Mib trace, I would get some connection timeouts but not an OOM.

For the first two scenarios, once I reached the response larger than the max (45940141 vs 33554432) message, raising the configured value above the requested value allowed the query to proceed. I also had to raise the query timeout because my workstation was crying.

My current config is such.

...
    overrides:
        max_bytes_per_trace: 5e+08
        max_search_bytes_per_trace: 500000
        per_tenant_override_config: /overrides/overrides.yaml
    querier:
        frontend_worker:
            frontend_address: query-frontend-discovery.default.svc.cluster.local:9095
            grpc_client_config:
                max_recv_msg_size: 9e+07
                max_send_msg_size: 9e+07
        query_timeout: 10m
    server:
        grpc_server_max_recv_msg_size: 1.048576e+08
        grpc_server_max_send_msg_size: 1.048576e+08
...

After chatting with one of my teammates, it sounds like this is a known issue with the way that responses are handed back through the query-frontend, in that we do several rounds of marshal/unmarshal when combining traces to respond to queries.

For your scenario three, I think raising the memory for your query-frontend may be the only answer at this point.

For the project, this is something that we'd like to address. We have a few ideas about how to reduce the number of marshal/unmarshal which should reduce the memory pressure. To help us address this, can you provide some detail about your traces? What size are the traces that Tempo is struggling with? Are these traces the standard size in your environment, or are these ones the outliers?

sherifkayad commented 2 years ago

@zalegrala thanks for your prompt response .. below some feedback points from my side:

Did you actually perform the query against the Query Frontend? .. In my case aside from the timeouts in Grafana, I also attempted to port-forward to the Query Frontend and directly hit its Search by ID API to get that OOM behavior
As I said I increased the memory to ~5 Gigs and still I can see that the Query Frontend crashed with an OOM on me .. also worth mentioning the pod wasn't utilizing the 5 Gigs! .. Somehow it was showing ~1.something Gigs utilized then it went dark to OOM .. probably the increase happened so fast that Prometheus didn't catch it .. maybe I don't know 🤔
Unfortunately these kinds of big traces are becoming kinda standard with us .. we are talking about 60-something MB size traces with a few hundred spans between the >50 Microservices we have, the databases and the message broker
May I ask you, with 9e+07 as the max_send_msg_size how much memory did you allocate to the Query Frontend and the Querier?

zalegrala commented 2 years ago

Thanks for the information.

In my test, I did query the frontend directly. In my case, port 3200 is mapped to the query-frontend, so just issuing curl -s http://localhost:3200/api/traces/4fe154b38dee96fb19c30b4e2091e559 was enough for me to test.

For the limits, here is what I have set currently, which I could tweak and certainly get an OOM, since it is just running on my workstation. :)

{
  "limits": {
    "cpu": "4",
    "memory": "8Gi"
  },
  "requests": {
    "cpu": "100m",
    "memory": "100Mi"
  }
}

Checking this against one of our larger clusters, I see we don't even have limits set on the query-frontend, and in another cluster we have memory limit set to 4Gi, and regularly see spikes pretty close to that number. Just looking at the container_memory_working_set_bytes that we use in the mixin on the operational dashboard to get a sense.

That could certainly be the case that the memory spikes too quickly to detect. Other than execing into the pod and running ps in a loop, I'm not sure how to know the max memory used by the process.

sherifkayad commented 2 years ago

Let me try with a higher limit than the 5 Gigs e.g. with 8 Gigs or so and a 64 MB value for the grpc_client and get back to you.

zalegrala commented 2 years ago

This was enough to see the memory spike within the pod: while true; do ps -o rss,vsz; sleep .01; done

sherifkayad commented 2 years ago

out of curiosity, would it help to run more queriers? e.g. at the moment we are running 3 Queriers with some beefy resources. Would it help to scale these horizontally? .. Also worth mentioning we are running 2 Query Frontends.

zalegrala commented 2 years ago

Just confirming with my team, but I don't think there would be an impact here with the number of queriers. Each querier will require a little more memory on the frontend for tracking purposes, but not enough that I would expect it to impact your situation.

sherifkayad commented 2 years ago

@zalegrala I kept pushing the memory limits step by step till 16 Gigs .. then the query frontend consumption was something like:

Unfortunately the query didn't run .. it kep running almost infinitely and no results were retrieved. One other funny behavior I noticed was that even if you specify a time range where that trace shouldn't be there, the query frontend and the querier kept on working to get it .. I could see some logs in the Querier like:

level=info ts=2022-06-17T08:01:28.67472812Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=20a6b62c-0722-4816-9e14-b5ff29b5597a found=true
level=info ts=2022-06-17T08:01:28.675180513Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=839b2610-8862-45d6-b865-6b8edae13213 found=true
level=info ts=2022-06-17T08:01:28.928832925Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=1ec76d5e-dfbc-4ce2-827b-02efd37ec839 found=true
level=info ts=2022-06-17T08:01:28.983337825Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=42be281d-7c84-4b08-960f-ad3114bc88f2 found=true
level=info ts=2022-06-17T08:01:28.995746926Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=88a6ca41-d320-4970-a076-d4f547359dfe found=true
level=info ts=2022-06-17T08:01:29.009640648Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=a75c4d73-502c-4585-92a6-a7e1e1ad6847 found=true
level=info ts=2022-06-17T08:01:29.073388561Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=2d4117f0-a71c-4b93-8dd0-1b9382b672f0 found=true
level=info ts=2022-06-17T08:01:30.1867258Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=a32c7612-b477-4aed-9dcf-642dd247ee7f found=true
level=info ts=2022-06-17T08:01:30.192600175Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=a40cca20-64d2-450b-a308-ab40688af578 found=true
level=info ts=2022-06-17T08:01:30.231101367Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=881fc0ab-2374-4257-b5c3-439da49b00a9 found=true
level=info ts=2022-06-17T08:01:30.32067071Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=86ae1e37-0508-4e30-99e8-e2db2d96b65f found=true
level=info ts=2022-06-17T08:01:29.083943289Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=887e3942-e2ac-4222-be8e-bf2516a8621d found=true
level=info ts=2022-06-17T08:01:29.105556398Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=4b4763c2-a3b3-4b83-946d-f8dc2aee5d1e found=true
level=info ts=2022-06-17T08:01:29.185914154Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=89d373ec-654c-44a7-820a-2e92116cbe96 found=true
level=info ts=2022-06-17T08:01:29.286247542Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=2826aa37-4ebb-452f-a4c6-e07bfacf146f found=true
level=info ts=2022-06-17T08:01:29.288657905Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=9d3d3e0e-8918-4361-a77e-a1c1551238a7 found=true
level=info ts=2022-06-17T08:01:29.306442267Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=84a25621-aa0f-4276-8528-5cce91aff0e3 found=true
level=info ts=2022-06-17T08:01:29.309636864Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=280110cc-05f8-4fbb-8d08-d73ff07036fb found=true
level=info ts=2022-06-17T08:01:29.399623259Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=9f61d693-5557-4cf0-820f-9bac43b2cebc found=true
level=info ts=2022-06-17T08:01:29.426682792Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=1de57273-bfb0-4899-8925-b4a79ecd6e54 found=true
level=info ts=2022-06-17T08:01:29.482513792Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=2bc16b06-5fd0-4b0f-ae8c-5906c758414a found=true
level=info ts=2022-06-17T08:01:29.496012667Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=a69296f6-70fb-42df-b8fd-bff1ad11e31c found=true
level=info ts=2022-06-17T08:01:29.572987836Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=28dfd3d5-b84c-4506-adda-4e57096877e7 found=true
level=info ts=2022-06-17T08:01:29.597361773Z caller=tempodb.go:348 org_id=single-tenant msg="searching for trace in block" findTraceID=0000000000000000ea233ba0fabfdf58 block=a0610d3f-7a55-427d-956b-eb55e09cd200 found=true

These logs went on almost forever .. I really do believe something is fishy with the search behavior of that exact trace and similar traces in size. I have no idea what's wrong though ..

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

grafana / tempo

Retrieving big traces fails with "Internal Server Error Body: response larger than the max" & Query FE OOM #1489