thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

AKC timeout using Thanos #7874

Open antikilahdjs opened 1 day ago

antikilahdjs commented 1 day ago

Thanos with Memcached enabled plus MiniO as Long-term

Thanos, Prometheus and Golang version used:

Object Storage Provider: S3 MiniO

What happened: I have configured my Thanos alongside Memcached but I am not able to fix the error about my query search when I need search more than 2 days. I am getting the error below

receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout

My Thanos Store:

args:
            - store
            - '--log.level=info'
            - '--log.format=logfmt'
            - '--data-dir=/var/thanos/store'
            - '--grpc-address=0.0.0.0:10901'
            - '--http-address=0.0.0.0:10902'
            - '--objstore.config=$(OBJSTORE_CONFIG)'
            - '--ignore-deletion-marks-delay=24h'
            - '--block-sync-concurrency=120'
            - '--sync-block-duration=60m'
            - '--index-cache-size=4096MB'
            - '--chunk-pool-size=4GB'
            - '--store.grpc.series-max-concurrency=300'
            - '--consistency-delay=30m'
            - |-
              --index-cache.config="config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "60s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "max_item_size": 0
                "timeout": "180s"
              "type": "MEMCACHED"
            - |-
              --store.caching-bucket.config="blocks_iter_ttl": "720h"
              "chunk_object_attrs_ttl": "720h"
              "chunk_subrange_size": 128000
              "chunk_subrange_ttl": "720h"
              "config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "60s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "max_item_size": 0
                "timeout": "180s"
              "max_chunks_get_range_requests": 3
              "metafile_content_ttl": "720h"
              "metafile_doesnt_exist_ttl": "1h"
              "metafile_exists_ttl": "720h"
              "metafile_max_size": "4MiB"
              "type": "MEMCACHED"
            - |-
              --tracing.config="config":
                "sampler_param": 2
                "sampler_type": "ratelimiting"
                "service_name": "thanos-store"
              "type": "JAEGER"

My Thanos Frontend

args:
            - query-frontend
            - '--enable-auto-gomemlimit'
            - '--log.level=info'
            - '--log.format=logfmt'
            - '--query-frontend.compress-responses'
            - '--http-address=0.0.0.0:9090'
            - >-
              --query-frontend.downstream-url=http://thanos-query.thanos.svc.cluster.local.:9090
            - '--query-range.split-interval=24h'
            - '--labels.split-interval=12h'
            - '--query-range.max-retries-per-request=100'
            - '--labels.max-retries-per-request=25'
            - '--query-frontend.log-queries-longer-than=0'
            - '--query-range.max-query-parallelism=120'
            - '--query-frontend.vertical-shards=0'
            - '--cache-compression-type='
            - '--query-frontend.downstream-tripper-config={"response_header_timeout": "5m", "max_idle_conns_per_host": 100}'
            - |-
              --query-range.response-cache-config="config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "30s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "timeout": "180s"
                "expiration": "720h"
              "type": "MEMCACHED"
            - |-
              --labels.response-cache-config="config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "30s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "timeout": "180s"
                "expiration": "720h"
              "type": "MEMCACHED"
            - |-
              --tracing.config="config":
                "sampler_param": 2
                "sampler_type": "ratelimiting"
                "service_name": "thanos-query-frontend"
              "type": "JAEGER"

My Prometheus:

containers:
    - args:
        - '--web.console.templates=/etc/prometheus/consoles'
        - '--web.console.libraries=/etc/prometheus/console_libraries'
        - '--storage.tsdb.retention.time=12h'
        - '--config.file=/etc/prometheus/config_out/prometheus.env.yaml'
        - '--storage.tsdb.path=/prometheus'
        - '--web.enable-lifecycle'
        - '--web.enable-admin-api'
        - '--web.route-prefix=/'
        - '--web.config.file=/etc/prometheus/web_config/web-config.yaml'
        - '--storage.tsdb.max-block-duration=2h'
        - '--storage.tsdb.min-block-duration=2h'
        - '--web.max-connections=8096'
        - '--query.max-concurrency=60'
      image: 'prom/prometheus:v2.49.1'

What you expected to happen:

My Prometheus have 6h of retention but if I try search more than this am getting the error mentioned

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout

Anything else we need to know:

ts=2024-08-22T04:15:02.506236929Z caller=memcached_client.go:438 level=warn name=index-cache msg="failed to fetch items from memcached" numKeys=1 firstKey=EP:01J5TQ7GTAK7JFP1SDHAZQABMB:NskVASoO0H1CJRIx74k3hIBPzIM6wCRkKvWOjc9V3Dg:dss err="write tcp 10.233.66.17:47668->10.233.31.160:11211: write: connection timed out"

Environment:

-->

Could you please help me to understand what I did wrong?

dosubot[bot] commented 1 day ago

The error you're encountering, "keepalive ping failed to receive ACK within timeout," suggests that the gRPC keepalive settings might need adjustment. Thanos uses keepalive pings to ensure that connections remain active, and if these pings fail, it can lead to the errors you're seeing.

Here are some steps you can take to address this issue:

  1. Adjust Keepalive Settings: You can modify the keepalive settings in your Thanos configuration. The default settings for keepalive pings in Thanos are:

    • Time: 10 seconds
    • Timeout: 5 seconds

    You might want to increase these values to allow more time for the keepalive pings to be acknowledged. For example, you could set:

    grpc:
     keepalive:
       time: 30s
       timeout: 15s
  2. Check Network Stability: Ensure that your network is stable and that there are no intermittent connectivity issues between Thanos components and Memcached.

  3. Review Memcached Configuration: The warning about failing to fetch items from Memcached could indicate a configuration issue. Consider the following adjustments:

    • Increase max_async_buffer_size and max_get_multi_batch_size to handle more concurrent requests.
    • Ensure that the Memcached instance has sufficient resources and is not overloaded.
  4. Resource Allocation: Verify that your Thanos Store and Query components have enough CPU and memory resources allocated to handle the load, especially when querying large time ranges.

  5. Monitor and Benchmark: Use monitoring tools to observe the performance of your Thanos setup and run benchmarks to identify any bottlenecks.

These steps should help you address the keepalive timeout issue and improve the stability of your Thanos setup with Memcached and MiniO. If the problem persists, further investigation into network and resource configurations may be necessary.

To continue talking to Dosu, mention @dosu.