Feature Request: hedged request between the external cache and the object storage

thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.

https://thanos.io

Apache License 2.0

13.03k stars 2.09k forks source link

Feature Request: hedged request between the external cache and the object storage #6712

Open damnever opened 1 year ago

damnever commented 1 year ago

Is your proposal related to a problem?

The long-tail requests sometimes are inevitable between the store-gateway and the external cache service. Lowering the timeouts between the store-gateway and the cache service isn't a proper way to address this problem.

Describe the solution you'd like

If accessing the external cache service takes too long, issue a hedged request to the object storage, as object storages have reasonable latency on average nowadays.

Describe alternatives you've considered

Additional context

GiedriusS commented 1 year ago

I had a similar idea. We could use this https://github.com/cristalhq/hedgedhttp HTTP client to implement this. Just before that, we could make it better by estimating the 90th percentile and automatically sending a hedged request if the duration exceeds that. T-Digest seems like a good option for estimating the percentiles. Ideally, we would like to avoid specifying a threshold from which another request should be sent.

Vanshikav123 commented 11 months ago

Hello @GiedriusS can I work on this ?

GiedriusS commented 11 months ago

@Vanshikav123 sure. https://github.com/cristalhq/hedgedhttp/pull/52 that client now supports dynamic thresholds/durations so shouldn't be too hard to implement with t-digest :thinking:

Vanshikav123 commented 11 months ago

@GiedriusS it would be great help if you provide me with some references for this issue.

rahulbansal3005 commented 2 months ago

Hi @GiedriusS @damnever, I am interested in working on this issue, in the LFX term 3.

Zyyeric commented 1 month ago

Hi @GiedriusS ! I am very interested in working on this issue through LFX. Just wondering do I need to submit a proposal on the implementation?

GiedriusS commented 1 month ago

Please submit everything through the LFX website 😊

aakashbansode2310 commented 1 month ago

Hello @GiedriusS @saswatamcode, I hope this message finds you well. My name is Aakash Undergraduate from IIT Bombay, and I am excited to contribute to the implementation of hedged requests for reducing tail latency in Thanos. I'm eager to help enhance the performance and reliability of Thanos and would greatly appreciate your guidance. I'm looking forward to collaborating and making this improvement together!

mani1911 commented 1 month ago

I am really interested in contributing to Thanos. Is there any pretests that I can work on? @damnever @GiedriusS Should I submit my proposal through cover letter (LFX term 3)?

saswatamcode commented 1 month ago

Yes, please submit everything using the LFX website 🙂

mani1911 commented 1 month ago

Yes, please submit everything using the LFX website 🙂

Is there any pre task that can help me understand Thanos better? I looked up on the working of Thanos.

Zyyeric commented 1 month ago

@GiedriusS @saswatamcode I am a bit confused about how https://github.com/cristalhq/hedgedhttp would be able to achieve this task. The hedge HTTP client in this implementation would send HTTP requests to the same destination, while in this use case, the first and second requests shall be sent to different services, external cache service, and obj respectively. Would something like a timeout monitoring mechanism since the first request, and then sending the second request using the same HTTP client if latency > t-digest.Quantile(90) make sense?

GiedriusS commented 1 month ago

Yeah, sorry for the confusion 🤦 hedging between two different systems doesn't make sense. Cache operations are supposed to be ultra fast. I believe the original issue is that with some k/v storages like memcached one is always forced to download the same data. So, in case the cached data is big, it takes a long time. This could be solved by having a two layered cache. We use client-side caching in Redis to solve this problem and it works well. With it, hot items don't need to be re-downloaded constantly because they are kept in memory. I will edit the title/description once I have some time unless someone disagrees.

And yes, I do imagine it to work something like that. The hedged HTTP client works like that - it sends another request if some timeout is reached. We could use the t-digest library to avoid the guesswork of setting the latency after which to send another request manually.

milinddethe15 commented 1 month ago

@GiedriusS If you have a moment, could you clarify this for me?

Do obj storage providers internally manage query requests among replicas? If not then do we need to make thanos do that for hedged requests? https://cloud-native.slack.com/archives/CK5RSSC10/p1723450096247419?thread_ts=1723358359.204139&cid=CK5RSSC10

yeya24 commented 5 days ago

@GiedriusS Is this issue still valid? From your comment looks like we can still have some sort of hedging but not using hedgedhttp library?