grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.04k stars 524 forks source link

AWS S3 HTTP 4xx client errors #3924

Closed icemanDD closed 3 weeks ago

icemanDD commented 3 months ago

Describe the bug Constantly see 4xx errors (~20% total requests) in AWS S3 buckets for tempo storage. Example error: REST.GET.OBJECT tempo_cluster_seed.json "GET /tempo_cluster_seed.json HTTP/1.1" 400 InvalidArgument 432 - 17 - "-" "MinIO (linux; amd64) minio-go/v7.0.70"

Screenshot 2024-07-30 at 2 50 46 PM

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo micro service mode with latest docker image
  2. Send traffic to AWS S3 storage and configure compactor for compaction

Expected behavior No or minimal 4xx error rates for S3 requests.

Additional Context

compactor:
  ring:
    kvstore:
      store: memberlist
    instance_id: 'tempo-compactor-{RANDOM_ID}'
    instance_interface_names:
      - ...
  compaction:
    block_retention: 720h
javiermolinar commented 3 months ago

Hi!,

If you have access rights to see the bucket, could you check if the file is there and it can be downloaded?

icemanDD commented 3 months ago

tempo_cluster_seed.json is available in the bucket, content:

{"UID":"...","created_at":"...","version":{"version":"main-...","revision":"...","branch":"main","buildUser":"","buildDate":"","goVersion":"..."}}

icemanDD commented 3 months ago

What is this file for? Does compactor need to frequently access and update the json file?

javiermolinar commented 3 months ago

What is this file for? Does compactor need to frequently access and update the json file?

This file is used to report usage statistics. The problem could be a misconfiguration but it's hard to know from that error. If you don't need that feature (most likely you don't) just disable it:

https://grafana.com/docs/tempo/latest/configuration/#usage-report

icemanDD commented 3 months ago

Interesting, which components will default enable this and access the S3 file? Should I disable for all tempo micro services?

javiermolinar commented 3 months ago

Interesting, which components will default enable this and access the S3 file? Should I disable for all tempo micro services?

This is not per component, it's a part of Tempo itself. It checks your config and sends the report back to Grafana

icemanDD commented 3 months ago

Then do we need to disable usage report in all Tempo components: distributor, ingester, compactor, metrics generator, querier and query frontend?

javiermolinar commented 3 months ago

Then do we need to disable usage report in all Tempo components: distributor, ingester, compactor, metrics generator, querier and query frontend?

Hi, no, it's a global config:

https://github.com/grafana/tempo/blob/5a6f140ef4b6ce18b32d78a31b3cdff6512a569f/cmd/tempo/app/config.go#L56

icemanDD commented 3 months ago

Do you mean when I apply the global config in any of the tempo components, it will work for all? To clarify we are using separate yaml file to configure each component. After adding

usage_report:
  reporting_enabled: false

to ingester and compactor, I still see high 4xx errors

javiermolinar commented 3 months ago

Do you mean when I apply the global config in any of the tempo components, it will work for all? To clarify we are using separate yaml file to configure each component. After adding

usage_report:
  reporting_enabled: false

to ingester and compactor, I still see high 4xx errors

Please take a look at the documentation linked to some comments above: https://grafana.com/docs/tempo/latest/configuration/#usage-report

https://github.com/grafana/helm-charts/blob/55304fdcd98754665c09ec998220778a14b77b36/charts/tempo/values.yaml#L42

icemanDD commented 3 months ago

After setting this global config to multiple tempo component, I did not see 4xx errors go down, anything else I should try to stop Tempo from GET /tempo_cluster_seed.json ?

icemanDD commented 3 months ago

Can anyone else help take a look at this? We are getting extra cost because of the additional Get actions

joe-elliott commented 3 months ago

So it seems like you are getting a lot of 4XX's and we believe it may be related to the usage reporting?

Are all 4xx's trying to get tempo_cluster_seed.json? Can you share rate of 4XX's by component? Are there any logs in Tempo that might explain what is happening?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.