thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.78k stars 2.04k forks source link

Thanos Store Does Not Reflect S3 Storage Unavailability in Health Checks #7505

Open BooNny95 opened 1 week ago

BooNny95 commented 1 week ago

Hello Thanos Contributors,

I am reaching out to discuss a potential issue with the health check mechanism of Thanos Store in scenarios where the selfhosted S3 storage becomes unreachable. When S3 is down, Thanos Store instances still appear as "ready" and "healthy" and continue to be available as endpoints in the querier. This behavior could lead to inefficiencies and inaccuracies in data retrieval and system monitoring.

Issue Description: Even when S3 is unreachable, Thanos Store instances remain in a 'healthy' state, which can mislead the querier and other dependent components about their actual status.

Steps to Reproduce:

In a testing environment, modify the /etc/hosts file to redirect the S3 domain to a non-existent IP address, simulating an unreachable S3 service. Observe that despite S3 being unreachable, the Thanos Store continues to send TCP SYN requests, which are met with TCP Reset packets from the dummy S3 server. Observed Logs:

Jul 02 15:52:47 host-test thanos-store[1091426]: ts=2024-07-02T13:52:47.032807576Z caller=store.go:477 level=warn msg="syncing blocks failed" err="BaseFetcher: iter bucket: Get \"https://s3storage.mydomain.net/testbucket/?delimiter=%2F&encoding-type=url&list-type=2&prefix=\": dial tcp 192.168.12.9:443: connect: connection refused"

Health Status from Querier: # curl 0:9092/api/v1/stores | jq { "status": "success", "data": { "store": [ { "name": "127.0.0.1:19093", "lastCheck": "2024-07-02T16:23:37.808834062+02:00", "lastError": null, "labelSets": [ { "datasrc": "prometheus-test", "replica": "host-test" } ], "minTime": 1719915817808, "maxTime": 1719921600000 } ] } }

Suggested Enhancement: Implementing a more robust health check mechanism that includes the status of the backend storage (S3 in this case) could improve the system's overall resilience and accuracy. This might involve:

Enhancing health checks to verify backend storage connectivity and reflect this in the health status. Potentially marking Thanos Store as 'unhealthy' when it cannot establish a connection to S3, thus preventing it from being queried until the issue is resolved. This adjustment would ensure that the Thanos architecture remains reliable and that data consistency is maintained even when backend services are disrupted.

Thank you for considering this enhancement. I am looking forward to your feedback and any further discussion on improving Thanos Store's resilience and operational accuracy.

Best regards, BooNny95

harry671003 commented 1 week ago

I am reaching out to discuss a potential issue with the health check mechanism of Thanos Store in scenarios where the selfhosted S3 storage becomes unreachable. When S3 is down, Thanos Store instances still appear as "ready" and "healthy" and continue to be available as endpoints in the querier. This behavior could lead to inefficiencies and inaccuracies in data retrieval and system monitoring.

Isn't this behavior expected? If S3 is completely unavailable, the queries going to all store instances to would fail anyways. How would adding a deep health check fix the inaccuracies in data retrieval?

Maybe you'd be interested in partial response strategy, which when enabled will succeed queries with partial results even some stores return error. https://thanos.io/v0.4/components/query/#partial-response-strategy

BooNny95 commented 1 week ago

Absolutely, the behavior you've described is indeed expected under the current configuration when S3 becomes unavailable. We have already implemented the partial response strategy which allows us to receive results from available sources like sidecar nodes that store local metrics. This setup is crucial for maintaining partial functionality in the face of S3 outages.

However, the core issue we're encountering stems from the way the Thanos Store nodes react when our self-hosted S3 server goes offline. Despite other components (like sidecars) still being operational and capable of providing partial data, the Store nodes continuously attempt to reconnect to the unavailable S3 service. This results not only in failed queries to the Store but also in a significant flood of TCP SYN and RST packets. This excessive network traffic has adverse effects on our network infrastructure, impacting the performance of other network devices.

The concern is not just about the retrieval of inaccurate or partial data - which the partial response strategy adeptly handles—but about the broader network impact caused by the current retry mechanism employed by the Store nodes when faced with an S3 outage. We believe that enhancing the health check mechanism to more accurately reflect the state of the backend storage connectivity could mitigate unnecessary network load and maintain network stability during such incidents.

Thank you for engaging in this discussion. I believe addressing this network issue will complement the partial response strategy and lead to a more robust and reliable system overall.

---- Update ---- Basically it is coused by minio-go lib: https://github.com/minio/minio-go/blob/master/retry.go#L29-L30

// MaxRetry is the maximum number of retries before stopping.
var MaxRetry = 10