Closed jon-rei closed 7 months ago
Good catch. This is a innocuous error. In RF3 if the querier gets 2 responses it can be assured that it has received all spans for a trace (due to read/write quorum) and so it cancels the final request.
We should log context cancelled on query at the debug level in the ingester.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
Describe the bug
Our ingester logs the following errors and warnings:
We run the Ingester with 3 replicas and
replication_factor: 3
. We used to run the Ingester with only 2 replicas andreplication_factor: 1
. If we change our current setup back to replication_factor 1 or even 2, the error is gone.Currently we are a little bit confused by those errors. They are happening quite frequently but it does not seem to break anything.
To Reproduce At least reproducable in our environment throught all of our clusters:
replication_factor: 1
replication_factor
to3
Expected behavior Running with
replication_factor: 3
works.Environment:
Additional Context We needed to tune the limits quite a lot.
configs
Overrides: ``` overrides: "*": max_traces_per_user: 5000000 ingestion_burst_size_bytes: 500000000 ingestion_rate_limit_bytes: 500000000 metrics_generator_processors: - service-graphs - span-metrics ``` Config: ``` usage_report: reporting_enabled: false multitenancy_enabled: false compactor: compaction: block_retention: 168h ring: kvstore: store: memberlist metrics_generator: processor: service_graphs: max_items: 100000 dimensions: - k8s.namespace.name span_metrics: dimensions: - k8s.namespace.name ring: kvstore: store: memberlist storage: path: {{ .Values.tempo.dataDir }}/wal remote_write: {{ include "common.tplvalues.render" (dict "value" .Values.metricsGenerator.remoteWrite "context" $) | nindent 6 }} distributor: ring: kvstore: store: memberlist receivers: ...snip query_frontend: max_outstanding_per_tenant: 3000 search: concurrent_jobs: 2000 querier: frontend_worker: frontend_address: {{ include "grafana-tempo.query-frontend.fullname" . }}-headless:{{ .Values.queryFrontend.service.ports.grpc }} grpc_client_config: max_send_msg_size: 500000000 trace_by_id: query_timeout: "60s" search: query_timeout: "120s" max_concurrent_queries: 50 ingester: lifecycler: ring: replication_factor: 3 kvstore: store: memberlist tokens_file_path: {{ .Values.tempo.dataDir }}/tokens.json memberlist: abort_if_cluster_join_fails: false join_members: - {{ include "grafana-tempo.gossip-ring.fullname" . }} overrides: per_tenant_override_config: /bitnami/grafana-tempo/conf/overrides.yaml server: http_server_read_timeout: "60s" http_server_write_timeout: "60s" http_listen_port: {{ .Values.tempo.containerPorts.web }} grpc_server_max_recv_msg_size: 500000000 grpc_server_max_send_msg_size: 500000000 storage: trace: backend: s3 s3: endpoint: ${BUCKET_HOST} bucket: ${BUCKET_NAME} access_key: ${AWS_ACCESS_KEY_ID} secret_key: ${AWS_SECRET_ACCESS_KEY} forcepathstyle: true insecure: true local: path: {{ .Values.tempo.dataDir }}/traces wal: path: {{ .Values.tempo.dataDir }}/wal cache: memcached memcached: consistent_hash: true host: {{ include "grafana-tempo.memcached.url" . }} service: memcache timeout: 500ms search: cache_control: footer: true column_index: true offset_index: true block: version: vParquet3 pool: max_workers: 100 cache_min_compaction_level: 1 cache_max_block_age: 48h ```