thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.04k stars 2.09k forks source link

thanos-store is too slow for query and don't know where the time went #6945

Open Melody-zyy opened 10 months ago

Melody-zyy commented 10 months ago

Thanos, Prometheus and Golang version used: thanos-store version:v0.30.2

Object Storage Provider: txyun cos

What happened: i used thanos-store to query history data,but its too slow,i has collected many metrics, but I still don't know where the time went

What you expected to happen: i want to data return in 20 seconds or i want to know where the time went, when i query hisotry data

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components: This log is because the query reaches the thanos-query timeout, which interrupts subsequent thanos-store queries to the memcached cache

level=warn ts=2023-11-30T14:51:06.285175671Z caller=memcached_client.go:436 name=index-cache msg="failed to fetch items from memcached" numKeys=5 firstKey=P:01HFYJB6BXH9BD03Z6H29VXEXG:iWpYeQElI53MP3C4cUnj44BIr-dGk2AQ8I4-BjlebIg err="context canceled"

Anything else we need to know:

  1. thanos-store has 10 shards based on the blocked id
  2. thanos-compactor normal run in compact and downsample
  3. i used memcached to cache
  4. thanos-query --store.response-timeout=40s

thanos-store config: image index-cache.yml: image bucket-cache.yml: image

yeya24 commented 10 months ago

There could be different reasons. Do you have tracing setup? Or if you enable debug logging in Thanos store, we can take a look at stats query processed and there are detailed query stats

Melody-zyy commented 10 months ago

There could be different reasons. Do you have tracing setup? Or if you enable debug logging in Thanos store, we can take a look at stats query processed and there are detailed query stats

  1. how to set up tracing?
  2. how to set up debug logging?
  3. Need some indicator data for reference? I can provide!
Melody-zyy commented 10 months ago

There could be different reasons. Do you have tracing setup? Or if you enable debug logging in Thanos store, we can take a look at stats query processed and there are detailed query stats

I want to know whether the time is spent on data pulling or data processing, or if there is a problem with the memcached server? Are there any indicators to reflect that

yeya24 commented 10 months ago

https://thanos.io/tip/thanos/tracing.md/#tracing You can set up tracing following our doc. It provides some examples.

Debug logging can be enabled via --log.level=debug.

There are also metrics in Store Gateway showing latency and latency requesting memcached.

Melody-zyy commented 10 months ago

https://thanos.io/tip/thanos/tracing.md/#tracing You can set up tracing following our doc. It provides some examples.

Debug logging can be enabled via --log.level=debug.

There are also metrics in Store Gateway showing latency and latency requesting memcached.

ok thanks,i will try

Melody-zyy commented 10 months ago

https://thanos.io/tip/thanos/tracing.md/#tracing You can set up tracing following our doc. It provides some examples.

Debug logging can be enabled via --log.level=debug.

There are also metrics in Store Gateway showing latency and latency requesting memcached.

level=debug ts=2023-12-04T03:37:25.626095328Z caller=bucket.go:1293 msg="stats query processed" request="min_time:1700252244000 max_time:1700264301000 matchers:xxx max_resolution_window:3600000 aggregates:COUNTER " stats="&{blocksQueried:8 postingsTouched:14 PostingsTouchedSizeSum:179MiB365KiB230B postingsToFetch:0 postingsFetched:1 PostingsFetchedSizeSum:178MiB551KiB436B postingsFetchCount:1 PostingsFetchDurationSum:37.181431729s cachedPostingsCompressions:1 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:178MiB551KiB436B CachedPostingsCompressedSizeSum:2MiB964KiB33B CachedPostingsCompressionTimeSum:1.119731555s cachedPostingsDecompressions:13 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:13.56747ms seriesTouched:49793 SeriesTouchedSizeSum:6MiB279KiB8B seriesFetched:0 SeriesFetchedSizeSum:0B seriesFetchCount:0 SeriesFetchDurationSum:0s chunksTouched:4855 ChunksTouchedSizeSum:1MiB356KiB606B chunksFetched:4855 ChunksFetchedSizeSum:19MiB561KiB685B chunksFetchCount:5 ChunksFetchDurationSum:2.369927365s GetAllDuration:38.304761546s mergedSeriesCount:4855 mergedChunksCount:4855 MergeDuration:29.653014ms}" err=null

level=debug ts=2023-12-04T07:01:48.448042489Z caller=bucket.go:1293 msg="stats query processed" request="min_time:1699642800000 max_time:1699646400000 matchers:<xxx> matchers:<xxx > max_resolution_window:3600000 aggregates:COUNT aggregates:SUM " stats="&{blocksQueried:8 postingsTouched:24 PostingsTouchedSizeSum:381MiB116KiB432B postingsToFetch:0 postingsFetched:2 PostingsFetchedSizeSum:378MiB861KiB256B postingsFetchCount:2 PostingsFetchDurationSum:40.695843201s cachedPostingsCompressions:2 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:378MiB861KiB256B CachedPostingsCompressedSizeSum:5MiB747KiB422B CachedPostingsCompressionTimeSum:2m27.28031484s cachedPostingsDecompressions:22 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:461.760505ms seriesTouched:102 SeriesTouchedSizeSum:12KiB79B seriesFetched:0 SeriesFetchedSizeSum:0B seriesFetchCount:0 SeriesFetchDurationSum:0s chunksTouched:6 ChunksTouchedSizeSum:4KiB57B chunksFetched:6 ChunksFetchedSizeSum:36KiB317B chunksFetchCount:1 ChunksFetchDurationSum:48.101376ms GetAllDuration:1m34.625021851s mergedSeriesCount:0 mergedChunksCount:0 MergeDuration:0s}" err="rpc error: code = Unknown desc = failed to receive any data from 01HFZ77QH5HNNVBCW9R5WT8HDW: context canceled"

When the query is slow, I look at the debug log and find that a large part of the time is spent on getting postings.Is this because postings data is too large?