store: i/o timeouts with new memcached config

kamilhristov commented 4 years ago

👋 Hello again and thanks for the quick response in https://github.com/thanos-io/thanos/issues/1974

Thanos version used: latest master with https://github.com/thanos-io/thanos/pull/1975

Object Storage Provider: AWS S3

What happened: I am testing the new memcached config for Thanos Store and I am seeing a lot of timeouts. The timeouts are gone when setting the limit to 10s 🤕

Memcached servers look OK and number of connections doesn't exceed max_idle_connections

Thanos Store process and Memcached servers are in the same AZ.

I tried the default config and also tweaked most of the parameters but to no avail.

Any recommendations ?

Arguments:

thanos store --index-cache.config-file=/app/cache-config.yaml --chunk-pool-size=6GB --objstore.config-file=/app/objstore-config.yaml

Config:

type: MEMCACHED
config:
  addresses:
    - "****.0001.euw1.cache.amazonaws.com:11211"
    - "****.0002.euw1.cache.amazonaws.com:11211"
  timeout: 1s
  max_idle_connections: 500
  max_async_concurrency: 200
  max_async_buffer_size: 10000
  max_get_multi_concurrency: 200
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s

Logs:

level=warn ts=2020-01-10T11:57:31.215656814Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35492->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.215238251Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35490->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.076957889Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35496->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.831121025Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:34270->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.830995908Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50162->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754847072Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50160->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754715844Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:46124->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148112488Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34404->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148818833Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:47196->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.108056739Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:48012->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.105953129Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34276->10.42.8.43:11211: i/o timeout"

bwplotka commented 4 years ago

cc @pracucci @GiedriusS (:

pracucci commented 4 years ago

Thanks @kamilhristov for the report. I'm seeing in the config you're using AWS ElastiCache: which instance type are you using for ElastiCache?

pracucci commented 4 years ago

Could you also show the following metrics graphed?

rate(thanos_memcached_operations_total[1m])
histogram_quantile(1, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
histogram_quantile(0.9, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
sum by (item_type) (rate(thanos_store_index_cache_hits_total[1m]) / rate(thanos_store_index_cache_requests_total[1m]))

Random thoughts:

We run memcached inside the same Kubernetes cluster where Thanos is running (on GKE) and we see a 100th percentile latency <= 500ms for memcached operations
A timeout when fetching items may be caused by a very large number of keys fetched within a single getmulti request, which you could limit setting max_get_multi_batch_size: 1000 (1k items per single request looks a reasonable limit to me), but this wouldn't explain the timeout when storing items (items are stored individually)

kamilhristov commented 4 years ago

I am using t3.medium instance type. It is small one but I thought that it should be sufficient since the in memory index cache size was 1GB.

Here's a screenshot of the graphs:

I can also test with m5.large or similar. Also happy to provide more info if needed.

pracucci commented 4 years ago

So, from your graphs I can see the 90th percentile latency is good, while the 100th is not. In my experience with AWS, performances on t2/t3 are unpredictable, so as next step I would suggest to try with the m5 before reiterating on the root cause analysis. May you give it a try, please?

kamilhristov commented 4 years ago

Agree that t2/t3 could cause strange issues :)

Just tried with m5.large and still there are timeouts for both get and set commands.

Here is a screenshot with the new instance type.

pracucci commented 4 years ago

Thanks for trying the m5
May you show the same charts but with irate() instead of rate()?
May you try to increase the max idle connections setting and set max_get_multi_batch_size: 1000?

kamilhristov commented 4 years ago

I've increased max idle connections to 1k and max_get_multi_batch_size to 1k.

Now I am getting:

level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"

Along with:

level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"

Current connections graph. The drop is after restarting Thanos Store with new parameters.

And new screenshot with irates 📈

pracucci commented 4 years ago

level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"

This can be easily fixed increasing max_async_buffer_size

pracucci commented 4 years ago

level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"

I still have no clue about this. Do memcached metrics show anything catching your attention? Among the metrics directly exported by memcached you should see the items fetched (we batch reads on the thanos side): what's the rate there?

pracucci commented 4 years ago

Another hypothesis is that the set operations are saturating the fetches. Just for testing, may you try with the following settings?

max_async_concurrency: 10
max_async_buffer_size: 1000000

pracucci commented 4 years ago

I've also asked about the memcached instance type. May you also confirm Thanos store is not running on a t2/t3 instance, please? I don't think so, but just a confirmation.

pracucci commented 4 years ago

I've run some benchmarks in our cluster, using two tools:

memtier_benchmark by RedisLabs
bench: a quick and dirty benchmark tool based on the Thanos MemcachedClient (code here)

Both benchmarks run against the same memcached server (1 instance), running in the same Kubernetes cluster where the benchmarking tool is running.

The results

$ memtier_benchmark --server=10.32.21.9 --port=11211 --protocol=memcache_text --requests=10000 --clients=200 --ratio=1:1 --data-size-range=100-200 --key-pattern=S:S --multi-key-get=100

4         Threads
200       Connections per thread
10000     Requests per client

ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        77044.82          ---          ---      5.23000     13995.64
Gets        77044.82     77044.82         0.00      5.22700     15107.53
Waits           0.00          ---          ---      0.00000          ---
Totals     154089.65     77044.82         0.00      5.22900     29103.17

Request Latency Distribution
Type     <= msec         Percent
------------------------------------------------------------------------
[REDACTED]
SET      56.000       100.00
---
[REDACTED]
GET      56.000       100.00
---

$ ./bench
Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 5 ms max: 76 ms
GETMULTI ops: 2000000 ops/s: 36058.11037331169
Total time: 55.466023574s

Comparison on `GetMulti()`

Raw performances: 77044 ops/s, 5.2ms avg latency, 56ms max latency
Benchmark based on Thanos client: 36058 ops/s, 5ms avg latency, 76ms max latency

The performances of the Thanos client don't look bad to me. It's about 50% compared to the raw performances, which looks "reasonable" to me (it shows there's room for improvement, but not bad compared to a c++ tool designed to squeeze out the performances with no business logic around it).

What's next

Other than the latest comments above:

May you run the same benchmark?
May you make sure that your Thanos store doesn't hit the CPU limit? Another reason why you may experience the i/o timeout may be if the Thanos store doesn't have enough CPU resources

kamilhristov commented 4 years ago

Sorry for the late response and thanks for spending time on this.

After setting max_async_concurrency max_async_buffer_size and all other recommended settings, I noticed that there are no failing write commands - only reads are failing.
Thanos Store is running on r5.large - not a huge one but it is perfectly utilized. Just in case I bumped the instance to xlarge - now the CPU usage is around 50% but there are still timeouts.
Benchmark results are not as good yours but still seem OK for our scale.

r5.large

4         Threads
200       Connections per thread
10000     Requests per client

ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        28019.71          ---          ---     14.19600      5089.95
Gets        28019.71     28019.71         0.00     14.19200      5494.32
Waits           0.00          ---          ---      0.00000          ---
Totals      56039.43     28019.71         0.00     14.19400     10584.26

Request Latency Distribution
Type     <= msec         Percent
------------------------------------------------------------------------
SET     130.000       100.00
---
GET     140.000       100.00

Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 18 ms max: 190 ms
GETMULTI ops: 2000000 ops/s: 10675.525173500942
Total time: 3m7.344413272s

Then with r5.xlarge max latency is similar but we get more throughput.

Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 9 ms max: 219 ms
GETMULTI ops: 2000000 ops/s: 20164.66655510673
Total time: 1m39.183390637s

CloudWatch is showing ~100,000,000 GET commands per minute when executing the benchmark. While Thanos Store generates ~25,000/min when hitting a dashboard.

Might be something with our infrastructure that is causing this. I will investigate more on Monday but will most likely stick with the in-memory cache for now. It seems like better option for our scale. Perhaps when our Thanos data becomes bigger, it will make more sense to offload the index cache to external service.

No need to invest more time on this as it seems that it is not caused by Thanos Store itself and thanks again for helping 🙇

pracucci commented 4 years ago

Thanks for all your help. I will continue spend a bit more of time on this by myself, and I will get back to you in case I find out anything.

but will most likely stick with the in-memory cache for now

Agree. The in-memory cache is the best option for the majority of the use cases.

GiedriusS commented 4 years ago

I have played around with the new memcached support and my anecdotal evidence says that this happens when memcached cannot keep up because not enough memory has been allocated to it and/or the timeouts are too low. Increasing the memory size with -m helps or enabling large pages support with -L. I will go ahead and release 0.10.0 but we could definitely improve the documentation here with recommendations on how much resources are needed and suggest what values ought to be in the index cache configuration. Please let us know if you find out a stable way to reproduce this.

kamilhristov commented 4 years ago

Could this be caused by the keys size ? When max_item_size is 10MB, I am seeing object too large errors in Thanos Store:

level=warn ts=2020-01-13T08:23:13.964559815Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="memcache: unexpected response line from \"set\": \"SERVER_ERROR object too large for cache\\r\\n\""

I was able to reproduce the timeouts when executing the benchmark script with larger data size range. Have in mind that this doesn't even hit the limit so the actual Thanos Store keys are even bigger.

Clients: 200
Reqs / client: 1000
Keys / req: 100
Data size range: 150000 - 300000
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61243->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61161->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61409->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60915->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61263->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61367->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61253->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61347->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61185->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60967->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61551->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61449->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61099->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61419->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61631->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61239->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61079->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61501->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61531->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61007->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61293->10.42.8.159:11211: i/o timeout"

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

kuberkaul commented 4 years ago

Facing the same issue, any documentation or work done on this ? : @pracucci Partitioned my bucket by time into 3 stores , all using a r5.large memcache each (3 nodes) [I have put the memcache endpoint and all node endpoints in the store config], facing a lot of timeout and async issues

kakkoyun commented 4 years ago

@kuberkaul Could you check following tips in the following conversation https://cloud-native.slack.com/archives/CL25937SP/p1585919142164500? You may find those tips useful.

Although I haven't tested it, I'm working on related configuration on https://github.com/thanos-io/kube-thanos/pull/106.

If you still think this a valid bug, we can open the issue again.

pracucci commented 4 years ago

I talked offline to @kuberkaul and shared some tips. Parallelly, I'm investigating a potential issue causing unfair prioritisation within the memcached client due to the usage of the gate. I will open a PR to fix if confirmed (will work on it shortly).

joshulyne commented 4 years ago

Hi, also facing the same issue 😞 Any further update on this? Been looking through some of the suggestions / tips from here and from the slack channel, and have been tweaking memcache configurations, yet I'm still running into timeout or "failed to wait for turn: context canceled" issues while trying to fetch items from memcache. Any advice / update would be much appreciated, thanks!

pracucci commented 4 years ago

I'm still running into timeout or "failed to wait for turn: context canceled" issues

This log means that a query request has been cancelled. For example, if you open a Grafana dashboard and, while panels are loading, you do refresh it the previous queries are cancelled and if they're cancelled while waiting for a memcached connection slot then you get that error logged.

joshulyne commented 4 years ago

@pracucci Ah that makes sense -- thanks for the explanation. Im still running into the timeout errors though :/ any further development or thoughts on that?

wiardvanrij commented 3 years ago

I'm also currently experiencing this issue while tuning Thanos Store components. Perhaps this issue can be re-opened?

bwplotka commented 3 years ago

Definitely, let's investigate. Help wanted.

wiardvanrij commented 3 years ago

Base-line that I started with.

  max_idle_connections: 750
  max_async_concurrency: 100
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 900
  max_get_multi_batch_size: 100

Then I increased max_get_multi_batch_size to 1000 and even to 10000.

What I learned was that, so long you don't have a massive dashboard that has like 12 panels and you go back 12 months in time, it does not really matter regarding the i/o timeouts. I simply do not get them with "sane" queries. Yet increasing the max_get_multi_batch_size to 1000++ improved my query times a lot. Like, really much smoother results.

So the challenge here is: How to "limit" the stores basically ddosing my memcached instances. So I figured to set the multi concurrency really low, and the batch size quite high.

  max_get_multi_concurrency: 10
  max_get_multi_batch_size: 10000

Basically there was no difference in query speed and I still would get i/o errors.

So I figured to go this path:

  timeout: 10s
  max_idle_connections: 1
  max_async_concurrency: 1
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 1
  max_get_multi_batch_size: 1000
  dns_provider_update_interval: 10s

I see no difference and I still get i/o errors on requesting a large dashboard with 30++ days history (i.e. non-sane queries basically).

From a user perspective: I get the feeling that the settings by --index-cache.config-file provide no actual input in the behaviour

wiardvanrij commented 3 years ago

So, I have fixed this issue for myself. As it was an implementation error xD

After investigation, because the config (per above) does not seem to make any difference, I noticed that it was NOT about the memcached index cache, but the bucket cache.

When I investigated this further, I noticed that I made a mistake on it's config. This was my config:

config:
  addresses: [*redacted*]
  chunk_subrange_size: 16000
  max_chunks_get_range_requests: 3
  chunk_object_attrs_ttl: 24h
  chunk_subrange_ttl: 24h
  blocks_iter_ttl: 5m
  metafile_exists_ttl: 2h
  metafile_doesnt_exist_ttl: 15m
  metafile_content_ttl: 24h
  metafile_max_size: 16MiB

I made the assumption that the bucket cache was not "that special" in comparison to the index cache
I missed the line in the docs that stated: "config field for memcached supports all the same configuration as memcached for index cache"

So after fixing my config file for the bucket cache to this, it was fixed:

type: MEMCACHED
config:
  addresses: [*redacted*]
  max_idle_connections: 750
  max_async_concurrency: 100
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 750
  max_get_multi_batch_size: 1000
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 16MiB

To be honest, a few improvements could be made IMO:

Update docs here: https://thanos.io/tip/components/store.md/#caching-bucket with an example of actually implementing the config for memcached
Give better log info regarding which memcached type is hit (is the log for index or bucket??)
When doing batchrequests, only the last error is being output. Add a debug log for each error, rather than saving the last.

I'm a bit tired, but will make issues/PR's tomorrow or something ;)

wiardvanrij commented 3 years ago

Created #3666 , #3667 and #3668 :)

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

ichasco-heytrade commented 3 years ago

I am getting the same error in the storegateway and in query-forntend

Thanos version: 0.18.0

thanos-storegateway-0 storegateway level=warn ts=2021-03-12T08:34:46.059016967Z caller=memcached_client.go:382 msg="failed to fetch items from memcached" numKeys=1 firstKey=exists:01F0JQDJWYRMH85CVZS2D9ASTV/meta.json err="read tcp 10.11.10.192:54558->172.20.72.201:11211: i/o timeout"
thanos-storegateway-0 storegateway level=warn ts=2021-03-12T08:34:46.059230491Z caller=memcached_client.go:382 msg="failed to fetch items from memcached" numKeys=1 firstKey=exists:01F0JQ3XPVGWYMDDWQ32CX0B76/meta.json err="read tcp 10.11.10.192:54550->172.20.72.201:11211: i/o timeout"
thanos-storegateway-0 storegateway level=warn ts=2021-03-12T08:34:46.05942657Z caller=memcached_client.go:382 msg="failed to fetch items from memcached" numKeys=1 firstKey=exists:01F0JPZ1BTTCFZ399DGNZ8DHVH/meta.json err="read tcp 10.11.10.192:54540->172.20.72.201:11211: i/o timeout"
thanos-storegateway-0 storegateway level=warn ts=2021-03-12T08:34:46.059636467Z caller=memcached_client.go:382 msg="failed to fetch items from memcached" numKeys=1 firstKey=exists:01F0JPA1SSJ731HGKR1FCP4FKP/meta.json err="read tcp 10.11.10.192:54536->172.20.72.201:11211: i/o timeout"
thanos-storegateway-0 storegateway level=warn ts=2021-03-12T08:34:47.855608354Z caller=memcached_client.go:382 msg="failed to fetch items from memcached" numKeys=2 firstKey=content:01F0JRVZBVXZ46758GKQYYHWS2/meta.json err="read tcp 10.11.10.192:54730->172.20.72.201:11211: i/o timeout"

StoreGateway config:

  extraFlags: 
    - --log.format=logfmt
    - --index-cache-size=500MB
    - --chunk-pool-size=1GB
    - --store.grpc.series-max-concurrency=20
    - --sync-block-duration=3m
    - --block-sync-concurrency=20
    - |-
      --index-cache.config="config":
        "addresses":
        - "memcached.monitoring.svc.cluster.local:11211"
        "max_idle_connections": 750
        "max_async_concurrency": 100
        "max_async_buffer_size": 10000000
        "max_item_size": "16MiB"
        "max_get_multi_concurrency": 750
        "max_get_multi_batch_size": 1000
        "timeout": "10s"
      "type": "memcached"
    - |-
      --store.caching-bucket.config="config":
        "addresses":
        - "memcached.monitoring.svc.cluster.local:11211"
        "max_idle_connections": 750
        "max_async_concurrency": 100
        "max_async_buffer_size": 10000000
        "max_item_size": "16MiB"
        "max_get_multi_concurrency": 750
        "max_get_multi_batch_size": 1000
      "type": "memcached"
      "chunk_subrange_size": 16000
      "max_chunks_get_range_requests": 3
      "chunk_object_attrs_ttl": "24h"
      "chunk_subrange_ttl": "24h"
      "blocks_iter_ttl": "5m"
      "metafile_exists_ttl": "2h"
      "metafile_doesnt_exist_ttl": "15m"
      "metafile_content_ttl": "24h"
      "metafile_max_size": "16MiB"

Thanks!

MrYueQ commented 3 years ago

Hi, I am getting the same error.

error log

{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:51390->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CZDVM6GTEW87W89YW4JDEW:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":4125,"ts":"2021-04-05T09:15:15.625479696Z"}
{"caller":"memcached_client.go:382","err":"read tcp 10.1.38.29:41902->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CHPCYEEKE1TDV4GT526A27:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16751,"ts":"2021-04-05T09:15:15.63488144Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:51376->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2BX37C68PA5C9WH9DJBX1PC:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":4036,"ts":"2021-04-05T09:15:15.635151769Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:41920->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CRJ46EPJPAJS9EYWEYKR03:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16705,"ts":"2021-04-05T09:15:15.635206581Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:51396->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2C3YYEE3MY9ZY93HG4QEP06:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16707,"ts":"2021-04-05T09:15:15.635329868Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:47862->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2DD59YDBWQV1QFNYZKWH745:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16649,"ts":"2021-04-05T09:15:15.659835826Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:49368->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2DTWRERTMH5F622XX5904YY:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16698,"ts":"2021-04-05T09:15:15.685193363Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:47844->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2E1RFPASEG1CYTZK6M4V3BY:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16717,"ts":"2021-04-05T09:15:15.685274897Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:47868->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2CZDVEB4E6W15BMRM0NMYYJ:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16692,"ts":"2021-04-05T09:15:15.718721764Z"}
{"caller":"memcached_client.go:382","err":"write tcp 10.1.38.29:41866->10.1.38.48:11211: i/o timeout","firstKey":"P:01F2E1RFPASEG1CYTZK6M4V3BY:p51fuquhpV774_dMrnY-ViH2EEbR1sEwJSSilUV7FjE","level":"warn","msg":"failed to fetch items from memcached","numKeys":16717,"ts":"2021-04-05T09:15:15.719999645Z"}

thanos and system version

thanos, version 0.19.0 (branch: HEAD, revision: ba6c5c4726ff52807c7383c68f2159b1af7980bb)
  build user:       root@b7176da2c64e
  build date:       20210331-12:23:43
  go version:       go1.15.10
  platform:         linux/amd64

uname -ra
Linux hostname 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

systemd

[Unit]
Description=thanos store  service
After=network.target

[Service]
LimitNOFILE=65535
Type=simple
Restart=always
RestartSec=1

ExecStart=/data/app/thanos/bin/thanos store --data-dir=/data/thanos \
                --log.format=json \
                --index-cache-size=512MB  \
                --objstore.config-file=/data/app/thanos/conf.d/object.json \
                --sync-block-duration=1m \
                --ignore-deletion-marks-delay=124h  \
                --store.enable-index-header-lazy-reader \
                --min-time=-1w \
                --max-time=-20h \
                --index-cache.config-file=/data/app/thanos/conf.d/index.json \
                --store.caching-bucket.config-file=/data/app/thanos/conf.d/cache.json

[Install]
WantedBy=multi-user.target

index-cache-config

type: MEMCACHED
config:
  addresses:
    - "ip:port"
  chunk_subrange_size: 16000
  max_chunks_get_range_requests: 3
  chunk_object_attrs_ttl: 24h
  chunk_subrange_ttl: 24h
  blocks_iter_ttl: 5m
  metafile_exists_ttl: 2h
  metafile_doesnt_exist_ttl: 15m
  metafile_content_ttl: 24h
  metafile_max_size: 16MiB

store-cache-bucket

type: MEMCACHED
config:
  addresses:
    - "ip:port"
  max_idle_connections: 1000
  max_async_concurrency: 200
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 750
  max_get_multi_batch_size: 1000
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 16MiB

Thanks!

mhulscher commented 3 years ago

What I found is that for us, i/o timeouts from thanos-store|thanos-query-frontend -> memcached are strongly correlated with memcached hitting its CPU limit. In other words once memcached is hitting 100% CPU we are seeing i/o timeouts.

This, in turn seems highly correlated to max_get_multi_batch_size. If set to 0 memcached chokes very quickly, causing both put-key and get-key operations to time-out (to fail). I have now set this to max_get_multi_batch_size: 1000 and this seems to work for us. Now thanos seems bound by the CPU of the thanos-store. Whose efficiency I am to improve by adding block-sharding or perhaps adding more CPU.

mhulscher commented 3 years ago

Also, I noticed that query results were not stored in memcached by thanos-query-frontend, not only because memcached was hitting 100% CPU but also because our max_item_size: 16MiB was too small. thanos-query-frontend was logging errors that were returned by memcached, saying that the stored items were too big.

We increased the max item size to 64MiB:

max_item_size: 67108864B

And we reduced the range of the queries (to make the results smaller):

- --labels.split-interval=12h
- --query-range.split-interval=12h

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Closing for now as promised, let us know if you need this to be reopened! 🤗

Antiarchitect commented 2 years ago

Is thre any updates on this. Also suffering from this problem. Memcached 1.6.12, various configs of thanos cache and Memcached itself are tested.

stats                                                                                                                                     [57/1840]
STAT pid 1
STAT uptime 79
STAT time 1633097669
STAT version 1.6.12
STAT libevent 2.1.8-stable
STAT pointer_size 64
STAT rusage_user 9.217141
STAT rusage_system 25.720073
STAT max_connections 1024
STAT curr_connections 574
STAT total_connections 1291
STAT rejected_connections 0
STAT connection_structures 575
STAT response_obj_oom 0
STAT response_obj_count 61
STAT response_obj_bytes 147456
STAT read_buf_count 227
STAT read_buf_bytes 3719168
STAT read_buf_bytes_free 2572288
STAT read_buf_oom 0
STAT reserved_fds 20
STAT cmd_get 42291
STAT cmd_set 771773
STAT cmd_flush 0
STAT cmd_touch 0
STAT cmd_meta 0
STAT get_hits 30560
STAT get_misses 11731
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 2337979587
STAT bytes_written 469912325
STAT limit_maxbytes 34359738368
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 17
STAT hash_bytes 1048576
STAT hash_is_expanding 0
STAT slab_reassign_rescues 0
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 0
STAT slab_reassign_busy_items 0
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 0
STAT lru_crawler_running 0
STAT lru_crawler_starts 2
STAT lru_maintainer_juggles 120086
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT log_watchers 0
STAT unexpected_napi_ids 0
STAT round_robin_fallback 0
STAT bytes 449291309
STAT curr_items 129957
STAT total_items 771773
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 0
STAT evicted_active 0
STAT evictions 0
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 127371
STAT lrutail_reflocked 1481
STAT moves_to_cold 681095
STAT moves_to_warm 5587
STAT moves_within_lru 345
STAT direct_reclaims 0
STAT lru_bumps_dropped 0
END

GiedriusS commented 2 years ago

I am currently digging this one so reopening. Still valid

Antiarchitect commented 2 years ago

Some errors from my side:

level=warn ts=2021-10-01T14:36:29.347655096Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=92 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:344480000:344496000 err=EOF
level=warn ts=2021-10-01T14:36:29.347697715Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=392 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:428720000:428736000 err=EOF
level=warn ts=2021-10-01T14:36:29.348214182Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=72 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:349008000:349024000 err="memcache: unexpected line in get response: \"ERROR\\r\\n\""
level=warn ts=2021-10-01T14:36:29.348253084Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=61 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:450112000:450128000 err="memcache: unexpected line in get response: \"ERROR\\r\\n\""
level=warn ts=2021-10-01T14:36:29.348263002Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=95 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:346720000:346736000 err=EOF
level=warn ts=2021-10-01T14:36:29.348667642Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=236 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:451984000:452000000 err=EOF

There are also old good i/o timeout and read/write broken pipe.

GiedriusS commented 2 years ago

https://github.com/thanos-io/thanos/pull/4742 this should greatly improve the situation. @Antiarchitect if you have time then maybe you could try running Thanos with that pull request?

Antiarchitect commented 2 years ago

@GiedriusS - problematic in my situation as I use Bitnami Thanos images with a lot of magic inside. So hope your MR will be accepted and I'll try the new release where it's included asap.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

uvaisibrahim commented 2 years ago

Is there any update on this?

I am also getting the below error while trying to get the historical data.

level=warn ts=2022-03-08T21:07:40.894536137Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=6 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:342336000:342352000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899116373Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=21 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:376128000:376144000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899390159Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=101 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:364272000:364288000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899632943Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=45 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:335872000:335888000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899892628Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=78 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:339424000:339440000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.900115794Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=16 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:366992000:367008000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901373702Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=3 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:366416000:366432000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901437644Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=86 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:343088000:343104000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901620772Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=6 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:354096000:354112000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901786015Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=77 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:378976000:378992000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901964644Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=53 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:400560000:400576000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.902197572Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=37 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:383280000:383296000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.902400874Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=18 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:377664000:377680000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917716417Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=48 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:230000000:230016000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917780801Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=113 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:238448000:238464000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917982688Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=126 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:235616000:235632000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.918115958Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=51 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:225808000:225824000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.918333518Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=122 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:232976000:232992000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.947289769Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=104 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:388032000:388048000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.947434684Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=37 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:355552000:355568000 err="memcache: connect timeout to 10.98.42.174:11211"

I have two queries,

Where the bucket cache config helps? Does it helps to get the historical data(which is not available in prometheus-thanos sidecar pod) quickly on querying? or Does it helps to load the indexes into memory while starting the thanos store pods?
How to figure out the error happening above.

My thanos store memcached configurations are give below. index caching config - memcached

type: memcached
config:
  addresses: [thanos-memcached.kts-plb-duck-plb-w2.svc.cluster.local:11211]
  timeout: 10s
  max_idle_connections: 400
  max_async_concurrency: 1000
  max_async_buffer_size: 0
  max_get_multi_concurrency: 0
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s
  max_item_size: 0

bucket caching config - memcached

type: memcached
config:
  addresses: [thanos-memcached.kts-plb-duck-plb-w2.svc.cluster.local:11211]
  timeout: 10s
  max_idle_connections: 400
  max_async_concurrency: 1000
  max_async_buffer_size: 0
  max_item_size: 0
  max_get_multi_concurrency: 0
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h

I had se value 0 for max_get_multi_batch_size, max_async_buffer_size, max_item_size, max_get_multi_concurrency to set unlimited capacity so that it should cause any issue because of the limits.

Thanks in advance

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

carlosjgp commented 1 year ago

Still valid!

robert-becker-hs commented 1 year ago

After trying all the suggestions from this post without any success, one of our engineers eventually found out that the issue was due to a connection issue with the Memcached nodes. We are running Thanos in EKS and are using Elasticache Memcached with 8 nodes. We have 18 thanos-store pods running on Kubernetes nodes.

Logs

level=warn ts=2022-03-08T21:07:40.918115958Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=51 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:225808000:225824000 err="memcache: connect timeout to 1.22.1.2:11211"

nslookup and netcat

## Using nslookup to get IP from Memcache cluster
$ nslookup thanos-production.xxx.cache.amazonaws.com
...
Non-authoritative answer:
Name:   thanos-production.xxx.cache.amazonaws.com
Address: 1.22.1.2

## Trying to ping IP and port using network
$ nc -zv 1.22.1.2 11211
# nc command hangs and there is no output

Outcome

As it turns out, we were sometimes able to run netcat successfully from our Kubernetes nodes, others would just hang and could not connect to the Memcached cluster. After restarting the problematic Memcached nodes, things started working and now all thanos-store pods are able to connect to Memcached. We don't know the root cause for the connection issue though but it's a step in the right direction. Maybe this helps some other folks!

damnever commented 1 year ago

Perhaps the network bandwidth has been exhausted; this is our situation.

percy78 commented 2 months ago

As this issue is still open I'm taking the opportunity to describe a similar fault that maybe somebody else come across. Last year we deployed our on-prem observability cluster for dev stg and prd. The three applications we're currently running to observe are loki for logs mimir for metrics and tempo for traces. Grafana server and keycloak are also part of the same cluster running backend mariadb in a galera cluster all in K8s RKE2.

As the cluster is situated in three different sites we have workernodes spread evenly in between the two main sites and then a small site catering for split-brain. Thus we're using the zone aware replication for loki mimir and tempo ingesters and mimir's store gw & alertmanager.

For backend we're storing this to minio deployed in the same cluster one minio cluster in each site and then running replication in between. Minio is stored only on two sites subsequently using active active replication between the two main sites.

All has been working fine up until a week ago when mimir suddenly in PROD started to behave badly, we first noticed this when the store gateway gave the first error message bucket-index file too old. Had to logon to minio with mc and delete bucket-index with previous versions which was a lot, surprisingly so when I thought I had incorporated a life cycle rule to only allow 5 versions for non current versions.

Anyhow, after deleting these we were able to query the data again although this time with horrendous performance. We have a 30 days retention in minio and although we can retrieve data it takes 4ever. The logs for the store gateway pods gives this.

2024-07-09T16:55:31.004840484+03:00 ts=2024-07-09T13:55:31.002044734Z caller=memcached_client.go:332 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=11 firstKey=subrange:anonymous/01J2AQ6M6338P7TVZ1CY791PKS/chunks/000001:34336000:34352000 err="memcache: connect timeout to 172.16.3.204:11211"

I've checked the memory and CPU on the worker nodes catering for minio and these have 4CPU 16GB each, four on each site and not one seems to give us the indication we're running out of memory or CPU

I have deleted mimir completely, deleted the cache data in the pvc for store gateway, deleted minio deployment to no avail. I'm lost with ideas now.

Could it be that the bucket which is now 5+TB is too big for the store gateway to trundle through which for me doesn't make sense as S3 buckets could be PB big

robert-becker-hs commented 2 months ago

@percy78 Try to restart your Memcached nodes one at a time or find out which node the above IP with the connection timeout belongs to. This "solved" the issue for us

thanos-io / thanos