grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.03k stars 521 forks source link

ResourceExhausted - Error on span names search #2207

Closed kostya9 closed 1 year ago

kostya9 commented 1 year ago

When searching for span names, I encounter ResourceExhausted error.

rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (5331337 vs. 4194304)

image

In logs, this looks like this:

vel=error ts=2023-03-15T11:37:57.404836021Z caller=frontend_processor.go:70 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (5330695 vs. 4194304)"
level=warn ts=2023-03-15T11:37:58.244604365Z caller=grpc_logging.go:64 method=/frontend.Frontend/Process duration=28m33.210104497s err="rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (5330697 vs. 4194304)" msg=gRPC
level=info ts=2023-03-15T11:37:58.244696423Z caller=handler.go:73 tenant=single-tenant method=GET traceID=1f7f5d02bdf27039 url=/api/search/tag/name/values duration=1.645341098s response_size=0 status=500 err="rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (5330697 vs. 4194304)"
level=error ts=2023-03-15T11:37:58.244695546Z caller=frontend_processor.go:70 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (5330697 vs. 4194304)"
level=warn ts=2023-03-15T11:37:58.244743799Z caller=logging.go:86 traceID=1f7f5d02bdf27039 msg="GET /api/search/tag/name/values (500) 1.645435978s Response: \"rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (5330697 vs. 4194304)\\n\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate; Accept-Language: en-US,en;q=0.9; User-Agent: Grafana/8.3.3; X-Grafana-Org-Id: 1; "

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo (https://hub.docker.com/layers/grafana/tempo/main-40cc4c5/images/sha256-37a3baa36c021f58b16e03ac694b355617d8abdfc2588aa31a5685f8711275cc?context=explore, this commit - https://github.com/grafana/tempo/commit/40cc4c5c6545f8ddeaedb2c9b56d3e73e7b12a70)
  2. Go to Tempo grafana page and wait for span names loading to complete

Expected behavior I expect that tag values should be limited by config (e.g. we currently set it via max_bytes_per_tag_values_query: 5000000), so this error should not probably manifest itself.

Environment:

Additional Context This error looks like it depends on the data we store somehow, however I still would expect it just to limit the values.

Config
GET /status/runtime_config
defaults:
  ingestion_rate_strategy: local
  ingestion_rate_limit_bytes: 150000000
  ingestion_burst_size_bytes: 200000000
  search_tags_allow_list: null
  max_traces_per_user: 100000
  max_global_traces_per_user: 0
  max_search_bytes_per_trace: 50000
  metrics_generator_ring_size: 0
  metrics_generator_processors: null
  metrics_generator_max_active_series: 0
  metrics_generator_collection_interval: 0s
  metrics_generator_disable_collection: false
  metrics_generator_forwarder_queue_size: 0
  metrics_generator_forwarder_workers: 0
  metrics_generator_processor_service_graphs_histogram_buckets: []
  metrics_generator_processor_service_graphs_dimensions: []
  metrics_generator_processor_span_metrics_histogram_buckets: []
  metrics_generator_processor_span_metrics_dimensions: []
  block_retention: 0s
  max_bytes_per_tag_values_query: 5000000
  max_search_duration: 0s
  max_bytes_per_trace: 5000000
  per_tenant_override_config: /conf/overrides.yaml
  per_tenant_override_period: 10s
overrides: {}
GET /status/config
---
target: all
search_enabled: true
metrics_generator_enabled: false
http_api_prefix: ""
server:
    http_listen_network: tcp
    http_listen_address: ""
    http_listen_port: 3100
    http_listen_conn_limit: 0
    grpc_listen_network: tcp
    grpc_listen_address: ""
    grpc_listen_port: 9095
    grpc_listen_conn_limit: 0
    http_tls_config:
        cert_file: ""
        key_file: ""
        client_auth_type: ""
        client_ca_file: ""
    grpc_tls_config:
        cert_file: ""
        key_file: ""
        client_auth_type: ""
        client_ca_file: ""
    register_instrumentation: true
    graceful_shutdown_timeout: 30s
    http_server_read_timeout: 30s
    http_server_write_timeout: 30s
    http_server_idle_timeout: 2m0s
    grpc_server_max_recv_msg_size: 4194304
    grpc_server_max_send_msg_size: 4194304
    grpc_server_max_concurrent_streams: 100
    grpc_server_max_connection_idle: 2562047h47m16.854775807s
    grpc_server_max_connection_age: 2562047h47m16.854775807s
    grpc_server_max_connection_age_grace: 2562047h47m16.854775807s
    grpc_server_keepalive_time: 2h0m0s
    grpc_server_keepalive_timeout: 20s
    grpc_server_min_time_between_pings: 10s
    grpc_server_ping_without_stream_allowed: true
    log_format: logfmt
    log_level: info
    log_source_ips_enabled: false
    log_source_ips_header: ""
    log_source_ips_regex: ""
    log_request_at_info_level_enabled: false
    http_path_prefix: ""
distributor:
    ring:
        kvstore:
            store: memberlist
            prefix: collectors/
            consul:
                host: localhost:8500
                acl_token: ""
                http_client_timeout: 20s
                consistent_reads: false
                watch_rate_limit: 1
                watch_burst_size: 1
                cas_retry_delay: 1s
            etcd:
                endpoints: []
                dial_timeout: 10s
                max_retries: 10
                tls_enabled: false
                tls_cert_path: ""
                tls_key_path: ""
                tls_ca_path: ""
                tls_server_name: ""
                tls_insecure_skip_verify: false
                username: ""
                password: ""
            multi:
                primary: ""
                secondary: ""
                mirror_enabled: false
                mirror_timeout: 2s
        heartbeat_period: 5s
        heartbeat_timeout: 5m0s
        instance_id: grafana-tempo-0
        instance_interface_names:
            - eth0
            - en0
        instance_port: 0
        instance_addr: ""
    receivers:
        jaeger:
            protocols:
                grpc:
                    endpoint: 0.0.0.0:14250
                thrift_binary:
                    endpoint: 0.0.0.0:6832
                thrift_compact:
                    endpoint: 0.0.0.0:6831
                thrift_http:
                    endpoint: 0.0.0.0:14268
        opencensus: null
        otlp:
            protocols:
                grpc:
                    endpoint: 0.0.0.0:4317
                http:
                    endpoint: 0.0.0.0:4318
    override_ring_key: distributor
    log_received_traces: false
    extend_writes: true
    search_tags_deny_list: []
ingester_client:
    pool_config:
        checkinterval: 15s
        healthcheckenabled: true
        healthchecktimeout: 1s
    remote_timeout: 5s
    grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600
        grpc_compression: snappy
        rate_limit: 0
        rate_limit_burst: 0
        backoff_on_ratelimits: false
        backoff_config:
            min_period: 100ms
            max_period: 10s
            max_retries: 10
        tls_enabled: false
        tls_cert_path: ""
        tls_key_path: ""
        tls_ca_path: ""
        tls_server_name: ""
        tls_insecure_skip_verify: false
metrics_generator_client:
    pool_config:
        checkinterval: 15s
        healthcheckenabled: true
        healthchecktimeout: 1s
    remote_timeout: 5s
    grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600
        grpc_compression: snappy
        rate_limit: 0
        rate_limit_burst: 0
        backoff_on_ratelimits: false
        backoff_config:
            min_period: 100ms
            max_period: 10s
            max_retries: 10
        tls_enabled: false
        tls_cert_path: ""
        tls_key_path: ""
        tls_ca_path: ""
        tls_server_name: ""
        tls_insecure_skip_verify: false
querier:
    search:
        query_timeout: 30s
        prefer_self: 2
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
    query_timeout: 10s
    max_concurrent_queries: 5
    frontend_worker:
        frontend_address: 127.0.0.1:9095
        scheduler_address: ""
        dns_lookup_duration: 10s
        parallelism: 2
        match_max_concurrent: true
        id: ""
        grpc_client_config:
            max_recv_msg_size: 104857600
            max_send_msg_size: 16777216
            grpc_compression: gzip
            rate_limit: 0
            rate_limit_burst: 0
            backoff_on_ratelimits: false
            backoff_config:
                min_period: 100ms
                max_period: 1s
                max_retries: 5
            tls_enabled: false
            tls_cert_path: ""
            tls_key_path: ""
            tls_ca_path: ""
            tls_server_name: ""
            tls_insecure_skip_verify: false
    query_relevant_ingesters: false
query_frontend:
    log_queries_longer_than: 0s
    max_body_size: 0
    query_stats_enabled: false
    max_outstanding_per_tenant: 100
    querier_forget_delay: 0s
    scheduler_address: ""
    scheduler_dns_lookup_period: 0s
    scheduler_worker_concurrency: 0
    grpc_client_config:
        max_recv_msg_size: 0
        max_send_msg_size: 0
        grpc_compression: ""
        rate_limit: 0
        rate_limit_burst: 0
        backoff_on_ratelimits: false
        backoff_config:
            min_period: 0s
            max_period: 0s
            max_retries: 0
        tls_enabled: false
        tls_cert_path: ""
        tls_key_path: ""
        tls_ca_path: ""
        tls_server_name: ""
        tls_insecure_skip_verify: false
    instance_interface_names: []
    address: ""
    port: 0
    downstream_url: ""
    max_retries: 2
    query_shards: 20
    search:
        concurrent_jobs: 50
        target_bytes_per_job: 10485760
        default_result_limit: 20
        max_result_limit: 0
        max_duration: 1h1m0s
        query_backend_after: 15m0s
        query_ingesters_until: 1h0m0s
compactor:
    ring:
        kvstore:
            store: ""
            prefix: collectors/
            consul:
                host: localhost:8500
                acl_token: ""
                http_client_timeout: 20s
                consistent_reads: false
                watch_rate_limit: 1
                watch_burst_size: 1
                cas_retry_delay: 1s
            etcd:
                endpoints: []
                dial_timeout: 10s
                max_retries: 10
                tls_enabled: false
                tls_cert_path: ""
                tls_key_path: ""
                tls_ca_path: ""
                tls_server_name: ""
                tls_insecure_skip_verify: false
                username: ""
                password: ""
            multi:
                primary: ""
                secondary: ""
                mirror_enabled: false
                mirror_timeout: 2s
        heartbeat_period: 5s
        heartbeat_timeout: 1m0s
        wait_stability_min_duration: 1m0s
        wait_stability_max_duration: 5m0s
        instance_id: grafana-tempo-0
        instance_interface_names:
            - eth0
            - en0
        instance_port: 0
        instance_addr: ""
        wait_active_instance_timeout: 10m0s
    compaction:
        chunk_size_bytes: 5242880
        flush_size_bytes: 20971520
        compaction_window: 1h0m0s
        max_compaction_objects: 6000000
        max_block_bytes: 107374182400
        block_retention: 336h0m0s
        compacted_block_retention: 24h0m0s
        retention_concurrency: 10
        iterator_buffer_size: 1000
        max_time_per_tenant: 5m0s
        compaction_cycle: 30s
    override_ring_key: compactor
ingester:
    lifecycler:
        ring:
            kvstore:
                store: inmemory
                prefix: collectors/
                consul:
                    host: localhost:8500
                    acl_token: ""
                    http_client_timeout: 20s
                    consistent_reads: false
                    watch_rate_limit: 1
                    watch_burst_size: 1
                    cas_retry_delay: 1s
                etcd:
                    endpoints: []
                    dial_timeout: 10s
                    max_retries: 10
                    tls_enabled: false
                    tls_cert_path: ""
                    tls_key_path: ""
                    tls_ca_path: ""
                    tls_server_name: ""
                    tls_insecure_skip_verify: false
                    username: ""
                    password: ""
                multi:
                    primary: ""
                    secondary: ""
                    mirror_enabled: false
                    mirror_timeout: 2s
            heartbeat_timeout: 5m0s
            replication_factor: 1
            zone_awareness_enabled: false
            excluded_zones: ""
        num_tokens: 128
        heartbeat_period: 5s
        heartbeat_timeout: 1m0s
        observe_period: 0s
        join_after: 0s
        min_ready_duration: 15s
        interface_names:
            - eth0
        final_sleep: 0s
        tokens_file_path: ""
        availability_zone: ""
        unregister_on_shutdown: true
        readiness_check_ring_health: true
        address: 127.0.0.1
        port: 0
        id: grafana-tempo-0
    concurrent_flushes: 16
    flush_check_period: 10s
    flush_op_timeout: 5m0s
    trace_idle_period: 10s
    max_block_duration: 1h0m0s
    max_block_bytes: 1073741824
    complete_block_timeout: 15m0s
    override_ring_key: ring
metrics_generator:
    ring:
        kvstore:
            store: inmemory
            prefix: collectors/
            consul:
                host: localhost:8500
                acl_token: ""
                http_client_timeout: 20s
                consistent_reads: false
                watch_rate_limit: 1
                watch_burst_size: 1
                cas_retry_delay: 1s
            etcd:
                endpoints: []
                dial_timeout: 10s
                max_retries: 10
                tls_enabled: false
                tls_cert_path: ""
                tls_key_path: ""
                tls_ca_path: ""
                tls_server_name: ""
                tls_insecure_skip_verify: false
                username: ""
                password: ""
            multi:
                primary: ""
                secondary: ""
                mirror_enabled: false
                mirror_timeout: 2s
        heartbeat_period: 5s
        heartbeat_timeout: 1m0s
        instance_id: grafana-tempo-0
        instance_interface_names:
            - eth0
            - en0
        instance_addr: 127.0.0.1
    processor:
        service_graphs:
            wait: 10s
            max_items: 10000
            workers: 10
            histogram_buckets:
                - 0.1
                - 0.2
                - 0.4
                - 0.8
                - 1.6
                - 3.2
                - 6.4
                - 12.8
            dimensions: []
        span_metrics:
            histogram_buckets:
                - 0.002
                - 0.004
                - 0.008
                - 0.016
                - 0.032
                - 0.064
                - 0.128
                - 0.256
                - 0.512
                - 1.024
                - 2.048
                - 4.096
                - 8.192
                - 16.384
            dimensions: []
    registry:
        collection_interval: 15s
        stale_duration: 15m0s
    storage:
        path: ""
        wal:
            wal_segment_size: 134217728
            wal_compression: false
            stripe_size: 16384
            truncate_frequency: 2h0m0s
            min_wal_time: 300000
            max_wal_time: 14400000
            no_lockfile: false
        remote_write_flush_deadline: 1m0s
storage:
    trace:
        pool:
            max_workers: 50
            queue_depth: 10000
        wal:
            path: /var/tempo/wal
            completedfilepath: /var/tempo/wal/completed
            blocksfilepath: /var/tempo/wal/blocks
            encoding: snappy
            search_encoding: none
            ingestion_time_range_slack: 2m0s
        block:
            index_downsample_bytes: 1048576
            index_page_size_bytes: 256000
            bloom_filter_false_positive: 0.01
            bloom_filter_shard_size_bytes: 102400
            version: v2
            encoding: zstd
            search_encoding: snappy
            search_page_size_bytes: 1048576
            row_group_size_bytes: 30000000
        search:
            chunk_size_bytes: 1000000
            prefetch_trace_count: 1000
            read_buffer_count: 8
            read_buffer_size_bytes: 4194304
            cache_control:
                footer: false
                column_index: false
                offset_index: false
        blocklist_poll: 5m0s
        blocklist_poll_concurrency: 50
        blocklist_poll_fallback: true
        blocklist_poll_tenant_index_builders: 2
        blocklist_poll_stale_tenant_index: 0s
        blocklist_poll_jitter_ms: 0
        backend: local
        local:
            path: /var/tempo/traces
        gcs:
            bucket_name: ""
            chunk_buffer_size: 10485760
            endpoint: ""
            hedge_requests_at: 0s
            hedge_requests_up_to: 2
            insecure: false
            object_cache_control: ""
            object_metadata: {}
        s3:
            bucket: ""
            endpoint: ""
            region: ""
            access_key: ""
            secret_key: ""
            insecure: false
            insecure_skip_verify: false
            part_size: 0
            hedge_requests_at: 0s
            hedge_requests_up_to: 2
            signature_v2: false
            forcepathstyle: false
            tags: {}
        azure:
            storage-account-name: ""
            storage-account-key: ""
            use-managed-identity: false
            user-assigned-id: ""
            container-name: ""
            endpoint-suffix: blob.core.windows.net
            max-buffers: 4
            buffer-size: 3145728
            hedge-requests-at: 0s
            hedge-requests-up-to: 2
        cache: ""
        cache_min_compaction_level: 0
        cache_max_block_age: 0s
        background_cache:
            writeback_goroutines: 10
            writeback_buffer: 10000
        memcached: null
        redis: null
overrides:
    ingestion_rate_strategy: local
    ingestion_rate_limit_bytes: 150000000
    ingestion_burst_size_bytes: 200000000
    search_tags_allow_list: null
    max_traces_per_user: 100000
    max_global_traces_per_user: 0
    max_search_bytes_per_trace: 50000
    metrics_generator_ring_size: 0
    metrics_generator_processors: null
    metrics_generator_max_active_series: 0
    metrics_generator_collection_interval: 0s
    metrics_generator_disable_collection: false
    metrics_generator_forwarder_queue_size: 0
    metrics_generator_forwarder_workers: 0
    metrics_generator_processor_service_graphs_histogram_buckets: []
    metrics_generator_processor_service_graphs_dimensions: []
    metrics_generator_processor_span_metrics_histogram_buckets: []
    metrics_generator_processor_span_metrics_dimensions: []
    block_retention: 0s
    max_bytes_per_tag_values_query: 5000000
    max_search_duration: 0s
    max_bytes_per_trace: 5000000
    per_tenant_override_config: /conf/overrides.yaml
    per_tenant_override_period: 10s
memberlist:
    node_name: ""
    randomize_node_name: true
    stream_timeout: 10s
    retransmit_factor: 2
    pull_push_interval: 30s
    gossip_interval: 1s
    gossip_nodes: 2
    gossip_to_dead_nodes_time: 30s
    dead_node_reclaim_time: 0s
    compression_enabled: false
    advertise_addr: ""
    advertise_port: 7946
    cluster_label: ""
    cluster_label_verification_disabled: false
    join_members: []
    min_join_backoff: 1s
    max_join_backoff: 1m0s
    max_join_retries: 10
    abort_if_cluster_join_fails: false
    rejoin_interval: 0s
    left_ingesters_timeout: 5m0s
    leave_timeout: 5s
    message_history_buffer_bytes: 0
    bind_addr: []
    bind_port: 7946
    packet_dial_timeout: 5s
    packet_write_timeout: 5s
    tls_enabled: false
    tls_cert_path: ""
    tls_key_path: ""
    tls_ca_path: ""
    tls_server_name: ""
    tls_insecure_skip_verify: false
usage_report:
    reporting_enabled: true
    backoff:
        min_period: 100ms
        max_period: 10s
        max_retries: 0
joe-elliott commented 1 year ago

The code that counts max tag value bytes is not exact, but just attempts to cut off the returned tags roughly when the threshold is crossed. I would either lower max_bytes_per_tag_values_query or raise the grpc message sizes:

server:
    grpc_server_max_recv_msg_size: <some value in bytes>
    grpc_server_max_send_msg_size: <some value in bytes>
kostya9 commented 1 year ago

I see, thanks. I can suggest adding this to docs - that ResourceExhausted grpc error in this case could be due to tags values exceeding a certain size.