Tempo Compactions Failing, CloudFlare R2 Bucket Always Increasing, Compactor throwing error: "error completing block: error completing multipart upload"

alextricity25 commented 1 week ago

Describe the bug About a week ago, our Tempo alerts for compaction failures started firing.

This is the query that we use. If it's above 1, the alerts fire.

sum by (cluster, namespace) (increase(tempodb_compaction_errors_total{}[1h]))

I checked the compactor logs (we only have one compactor running), and the only suspicious thing that I've noticed is messages like this:

level=error ts=2024-09-17T16:39:39.029886052Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID da530417-5510-4f22-996f-c35336eea572: error completing block: error completing multipart upload, object: single-tenant/da530417-5510-4f22-996f-c35336eea572/data.parquet, obj etag: : All non-trailing parts must have the same length."
level=error ts=2024-09-17T16:41:23.423707238Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID 5b745471-a889-40e9-8152-006625469f6c: error completing block: error completing multipart upload, object: single-tenant/5b745471-a889-40e9-8152-006625469f6c/data.parquet, obj etag: : All non-trailing parts must have the same length."
level=error ts=2024-09-17T16:43:08.882899458Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID 055d0b14-7789-492f-99ca-bd0b74aa1f2c: error completing block: error completing multipart upload, object: single-tenant/055d0b14-7789-492f-99ca-bd0b74aa1f2c/data.parquet, obj etag: : All non-trailing parts must have the same length."

On top of that, I also noticed that the r2 bucket that we use for traces is ever increasing in size, which didn't happen before observing the compaction errors.

To Reproduce Steps to reproduce the behavior:

Start Tempo 2.6.0
Configure Tempo to use a CloudFlare R2 Bucket as the s3 backend
Perform Operations (Read/Write/Others)
Observe compaction errors

Expected behavior I expect that there would be no compaction errors

Environment:

Infrastructure: Kubernetes GKE

Deployment tool: helm tempo-distributed, values below:

compactor:
autoscaling:
enabled: true
keda:
  enabled: true
  triggers:
  - metadata:
      customHeaders:
        X-Scope-OrgID: metamonitoring
      query: sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor",
        namespace=~".*"}) / ignoring(tenant) group_left count by (cluster, namespace)(tempo_build_info{container="compactor",
        namespace=~".*"})
      serverAddress: http://mimir-distributed-nginx.mimir.svc.cluster.local/prometheus
      threshold: "250"
    type: prometheus
config:
compaction:
  block_retention: 336h
extraEnv:
- name: JAEGER_ENDPOINT
value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
- name: JAEGER_SAMPLER_TYPE
value: const
- name: JAEGAR_SAMPLER_PARAM
value: "1"
resources:
requests:
  cpu: "2"
  memory: 4Gi
distributor:
autoscaling:
enabled: true
targetCPUUtilizationPercentage: 60
config:
log_received_spans:
  enabled: true
extraEnv:
- name: JAEGER_ENDPOINT
value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
- name: JAEGER_SAMPLER_TYPE
value: const
- name: JAEGAR_SAMPLER_PARAM
value: "1"
resources:
requests:
  cpu: "1"
  memory: 256Mi
gateway:
autoscaling:
enabled: true
maxReplicas: 10
minReplicas: 1
targetCPUUtilizationPercentage: 60
enabled: true
extraEnv:
- name: JAEGER_ENDPOINT
value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
- name: JAEGER_SAMPLER_TYPE
value: const
- name: JAEGAR_SAMPLER_PARAM
value: "1"
ingress:
annotations:
  cert-manager.io/cluster-issuer: issuer-9c66eaea
  nginx.ingress.kubernetes.io/auth-realm: Authentication Required - loki
  nginx.ingress.kubernetes.io/auth-secret: tempo-basic-auth
  nginx.ingress.kubernetes.io/auth-type: basic
enabled: true
hosts:
- host: tempo.xrdm.dev
  paths:
  - path: /
    pathType: Prefix
ingressClassName: nginx
tls:
- hosts:
  - tempo.xrdm.dev
  secretName: tempo-tls-cert
resources:
requests:
  cpu: 100m
  memory: 256Mi
global_overrides:
defaults:
metrics_generator:
  processors:
  - service-graphs
  - span-metrics
ingester:
autoscaling:
enabled: true
targetCPUUtilizationPercentage: 60
extraEnv:
- name: JAEGER_ENDPOINT
value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
- name: JAEGER_SAMPLER_TYPE
value: const
- name: JAEGAR_SAMPLER_PARAM
value: "1"
resources:
requests:
  cpu: "1"
  memory: 256Mi
metaMonitoring:
grafanaAgent:
enabled: true
installOperator: false
logs:
  remote:
    url: http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push
metrics:
  remote:
    headers:
      X-Scope-OrgID: metamonitoring
    url: http://mimir-distributed-nginx.mimir.svc.cluster.local:80/api/v1/push
serviceMonitor:
enabled: true
metricsGenerator:
config:
metrics_ingestion_time_range_slack: 10m
processor:
  service_graphs:
    dimensions:
    - xrdm_deployment
  span_metrics:
    dimensions:
    - xrdm_deployment
storage:
  remote_write:
  - send_exemplars: true
    url: http://mimir-distributed-nginx.mimir.svc.cluster.local:80/api/v1/push
enabled: true
querier:
autoscaling:
enabled: true
targetCPUUtilizationPercentage: 60
extraEnv:
- name: JAEGER_ENDPOINT
value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
- name: JAEGER_SAMPLER_TYPE
value: const
- name: JAEGAR_SAMPLER_PARAM
value: "1"
resources:
requests:
  cpu: "1"
  memory: 256Mi
queryFrontend:
autoscaling:
enabled: true
targetCPUUtilizationPercentage: 60
extraEnv:
- name: JAEGER_ENDPOINT
value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
- name: JAEGER_SAMPLER_TYPE
value: const
- name: JAEGAR_SAMPLER_PARAM
value: "1"
resources:
requests:
  cpu: "1"
  memory: 256Mi
storage:
trace:
backend: s3
block:
  parquet_row_group_size_bytes: 300MB
s3:
  access_key: xxx
  bucket: telemetry-traces
  endpoint: xxx
  region: us-east-1
  secret_key: xxx
traces:
otlp:
grpc:
  enabled: true
http:
  enabled: true
usage_report:
reporting_enabled: false

Additional Context

I've looked at this runbook, and have tried increasing the memory of the compactor pod significantly and adjusting compaction_window to 30min. I've also tried adjusting max_block_bytes to 5G, but the problem still persist. Some other issues that I've come across are:

https://github.com/grafana/tempo/issues/3529 https://github.com/grafana/tempo/issues/1774

After looking at those issues, i've since adjusted my bucket policies to delete objects one day after my tempo block retention period per @joe-elliott 's suggestion here.

joe-elliott commented 1 week ago

This seems to be an incompatibility between R2 and S3. Here is a similar discussion on cloudflare's forums:

https://community.cloudflare.com/t/all-non-trailing-parts-must-have-the-same-length/552190

It seems that when pushing a multipart upload to R2 all segments must have the same length except for the final part. When uploading a block Tempo currently flushes a rowgroup at a time which has a variable number of bytes. This is not easily corrected in Tempo.

Unless you find some very clean way to resolve this in Tempo we would likely not take a PR to "fix" this issue. To use Tempo with R2 we will likely need R2 to be S3 compatible.

alextricity25 commented 1 week ago

Thanks for your reply @joe-elliott. Yes, I ran into that post too, and a few others, but I figured maybe there is a way to configure Tempo to send chunks of the same length. Wouldn't setting parquet_row_group_size_bytes effectively do this?

Another interesting thing to note is that some compaction cycles do end up completing successfully. Not all fail with the All non-trailing parts must have the same length error. For example, here are some more logs from the compactor, which show that blocks are indeed successfully being flushed (does "flushed" in this context mean being sent to the backend storage bucket via the s3 api?).

level=info ts=2024-09-20T13:49:05.555318241Z caller=compactor.go:186 msg="beginning compaction" traceID=42f3b0ba265a0113
level=info ts=2024-09-20T13:49:05.555355933Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:ab121e03-8931-49e0-b144-11e908cf6d37 TenantID:single-tenant StartTime:2024-09-16 16:48:10 +0000 UTC EndTime:2024-09-16 17:19:30 +0000 UTC Tot
alObjects:40030 Size:197946447 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:3 DataEncoding: BloomShardCount:1 FooterSize:50483 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:49:05.653482732Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:8a64f55c-362e-4e10-bba4-f8832e6b2d36 TenantID:single-tenant StartTime:2024-09-16 16:55:20 +0000 UTC EndTime:2024-09-16 17:26:35 +0000 UTC Tot
alObjects:40253 Size:197983074 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:3 DataEncoding: BloomShardCount:1 FooterSize:50483 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:49:05.791970019Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:69f50bc3-df70-4256-a622-5d3297170408 TenantID:single-tenant StartTime:2024-09-16 17:18:04 +0000 UTC EndTime:2024-09-16 17:56:45 +0000 UTC Tot
alObjects:4927 Size:3881037 CompactionLevel:1 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:18783 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:49:53.774626347Z caller=compactor.go:250 msg="flushed to block" bytes=68231216 objects=14956 values=99996196
level=info ts=2024-09-20T13:50:37.246234046Z caller=compactor.go:250 msg="flushed to block" bytes=68339859 objects=15104 values=99993405
level=error ts=2024-09-20T13:51:19.25979008Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID 607efda4-c183-40b7-89e5-90c19ca6c463: error completing block: error completing multipart upload, object: single-tenan
t/607efda4-c183-40b7-89e5-90c19ca6c463/data.parquet, obj etag: : All non-trailing parts must have the same length."
level=info ts=2024-09-20T13:51:19.2598735Z caller=compactor.go:155 msg="Compacting hash" hashString=single-tenant-479584-0
level=info ts=2024-09-20T13:51:19.259919754Z caller=compactor.go:186 msg="beginning compaction" traceID=36d004b60c5e7aa1
level=info ts=2024-09-20T13:51:19.259962307Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:c0c857ff-d006-4f33-ad48-ea7a7bbe1031 TenantID:single-tenant StartTime:2024-09-16 16:25:10 +0000 UTC EndTime:2024-09-16 16:56:25 +0000 UTC Tot
alObjects:43818 Size:218888946 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:4 DataEncoding: BloomShardCount:1 FooterSize:66804 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:51:19.395246194Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:90f85458-f556-49b2-b78c-41ace5cc89fc TenantID:single-tenant StartTime:2024-09-16 16:17:51 +0000 UTC EndTime:2024-09-16 16:49:20 +0000 UTC Tot
alObjects:45199 Size:224820188 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:4 DataEncoding: BloomShardCount:1 FooterSize:67197 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:51:19.539027714Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:d36b0a24-644a-4698-9588-07db2b0f9188 TenantID:single-tenant StartTime:2024-09-16 15:47:40 +0000 UTC EndTime:2024-09-16 16:26:15 +0000 UTC Tot
alObjects:10961 Size:11425383 CompactionLevel:1 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:19148 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:52:06.777694147Z caller=compactor.go:250 msg="flushed to block" bytes=68436112 objects=16185 values=99991189

I wonder if there are some "sweet spot" settings with the compactor that would get me less failed compaction flushes. Maybe setting a low max_compaction objects or max_block_bytes would ensure that the multi-part upload will send chunks of the same length? I'll play around more with this settings.

joe-elliott commented 1 week ago

parquet_row_group_size_bytes is a target value. There's no way I'm aware of to enforce an exact row group size in parquet. Each row is added atomically and can't be split to force the row group to hit an exact byte count.

I wonder if there are some "sweet spot" settings with the compactor that would get me less failed compaction flushes.

Technically you can raise your row group size so that compactors/ingesters always flush parquet files with exactly one row group, but this will not work except for the smallest Tempo installs.

alextricity25 commented 1 week ago

Got it. Interestingly enough, I set max_compaction_objects to 10000, and that seems to set things in the right direction...

joe-elliott commented 1 week ago

Got it. Interestingly enough, I set max_compaction_objects to 10000, and that seems to set things in the right direction...

You are probably restricting the compactors to only creating very small blocks. These blocks all have 1 row group so the multipart upload is succeeding.

grafana / tempo

Tempo Compactions Failing, CloudFlare R2 Bucket Always Increasing, Compactor throwing error: "error completing block: error completing multipart upload" #4099