Open antikilahdjs opened 1 day ago
The error you're encountering, "keepalive ping failed to receive ACK within timeout," suggests that the gRPC keepalive settings might need adjustment. Thanos uses keepalive pings to ensure that connections remain active, and if these pings fail, it can lead to the errors you're seeing.
Here are some steps you can take to address this issue:
Adjust Keepalive Settings: You can modify the keepalive settings in your Thanos configuration. The default settings for keepalive pings in Thanos are:
Time
: 10 secondsTimeout
: 5 secondsYou might want to increase these values to allow more time for the keepalive pings to be acknowledged. For example, you could set:
grpc:
keepalive:
time: 30s
timeout: 15s
Check Network Stability: Ensure that your network is stable and that there are no intermittent connectivity issues between Thanos components and Memcached.
Review Memcached Configuration: The warning about failing to fetch items from Memcached could indicate a configuration issue. Consider the following adjustments:
max_async_buffer_size
and max_get_multi_batch_size
to handle more concurrent requests.Resource Allocation: Verify that your Thanos Store and Query components have enough CPU and memory resources allocated to handle the load, especially when querying large time ranges.
Monitor and Benchmark: Use monitoring tools to observe the performance of your Thanos setup and run benchmarks to identify any bottlenecks.
These steps should help you address the keepalive timeout issue and improve the stability of your Thanos setup with Memcached and MiniO. If the problem persists, further investigation into network and resource configurations may be necessary.
To continue talking to Dosu, mention @dosu.
Thanos with Memcached enabled plus MiniO as Long-term
Thanos, Prometheus and Golang version used:
Object Storage Provider: S3 MiniO
What happened: I have configured my Thanos alongside Memcached but I am not able to fix the error about my query search when I need search more than 2 days. I am getting the error below
receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
My Thanos Store:
My Thanos Frontend
My Prometheus:
What you expected to happen:
My Prometheus have 6h of retention but if I try search more than this am getting the error mentioned
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
Anything else we need to know:
ts=2024-08-22T04:15:02.506236929Z caller=memcached_client.go:438 level=warn name=index-cache msg="failed to fetch items from memcached" numKeys=1 firstKey=EP:01J5TQ7GTAK7JFP1SDHAZQABMB:NskVASoO0H1CJRIx74k3hIBPzIM6wCRkKvWOjc9V3Dg:dss err="write tcp 10.233.66.17:47668->10.233.31.160:11211: write: connection timed out"
Environment:
uname -a
): 4.8-->
Could you please help me to understand what I did wrong?