thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Memcached auto_discovery: panic: runtime error: invalid memory address or nil pointer dereference #5754

Open r0bj opened 1 year ago

r0bj commented 1 year ago

Thanos, Prometheus and Golang version used:

thanos version: v0.28.0

Object Storage Provider: GCP

What happened: Using memcached as an index cache with auto_discovery enabled causing thanos storegateway panic.

What you expected to happen: Using memcached as an index cache with auto_discovery enabled works as expected.

How to reproduce it (as minimally and precisely as possible): Memcached index cache (--index-cache.config-file) for storegateway:

    type: memcached
    config:
      addresses:
      - dnssrv+_memcached._tcp.memcached-index-cache.thanos.svc.cluster.local
      dns_provider_update_interval: 10s
      max_async_buffer_size: 10000
      max_async_concurrency: 20
      max_get_multi_batch_size: 0
      max_get_multi_concurrency: 100
      max_idle_connections: 100
      max_item_size: 1MiB
      timeout: 500ms
      auto_discovery: true

I've tried also address in a different form: memcached-index-cache:11211. Result is the same.

Address resolves to 3 IPs:

# host memcached-index-cache.thanos
memcached-index-cache.thanos.svc.cluster.local has address 10.200.221.196
memcached-index-cache.thanos.svc.cluster.local has address 10.200.244.105
memcached-index-cache.thanos.svc.cluster.local has address 10.200.28.253

Storegateway panics with:

level=info ts=2022-10-03T19:46:26.06869081Z caller=factory.go:50 msg="loading bucket configuration"
level=info ts=2022-10-03T19:46:26.069191966Z caller=caching_bucket_factory.go:76 msg="loading caching bucket configuration"
level=info ts=2022-10-03T19:46:26.071479623Z caller=memcached.go:49 msg="created memcached cache"
level=info ts=2022-10-03T19:46:26.071695662Z caller=factory.go:35 msg="loading index cache configuration"
level=warn ts=2022-10-03T19:46:26.07426665Z caller=provider.go:70 msg="failed to perform auto-discovery for memcached" address=memcached-index-cache:11211
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1486717]

goroutine 1 [running]:
github.com/thanos-io/thanos/pkg/discovery/memcache.(*Provider).Resolve(0xc000ff4620, {0x25861f8, 0xc000ff6900}, {0xc000fe8c80, 0x1, 0x8c1cf4?})
    /app/pkg/discovery/memcache/provider.go:92 +0x2b7
github.com/thanos-io/thanos/pkg/cacheutil.(*memcachedClient).resolveAddrs(0xc000ff8120)
    /app/pkg/cacheutil/memcached_client.go:661 +0xb0
github.com/thanos-io/thanos/pkg/cacheutil.newMemcachedClient({0x256cf60, 0xc000f500a0}, {0x2577d70?, 0xc000fda3c0}, {0x2580800?, 0xc000fd3950}, {{0xc000fe8c80, 0x1, 0x1}, 0x1dcd6500, ...}, ...)
    /app/pkg/cacheutil/memcached_client.go:362 +0x153a
github.com/thanos-io/thanos/pkg/cacheutil.NewMemcachedClientWithConfig({0x256cf60, 0xc000f500a0}, {0x20c590b, 0xb}, {{0xc000fe8c80, 0x1, 0x1}, 0x1dcd6500, 0x64, 0x14, ...}, ...)
    /app/pkg/cacheutil/memcached_client.go:250 +0x2a5
github.com/thanos-io/thanos/pkg/cacheutil.NewMemcachedClient({0x256cf60, 0xc000f500a0}, {0x20c590b, 0xb}, {0xc000fb6a00?, 0x5e4401?, 0xc000fb6600?}, {0x257ea00, 0xc000f50fa0})
    /app/pkg/cacheutil/memcached_client.go:230 +0x168
github.com/thanos-io/thanos/pkg/store/cache.NewIndexCache({0x256cf60, 0xc000f500a0}, {0xc000f461a0, 0x189, 0x1a0}, {0x257ea00, 0xc000f50fa0})
    /app/pkg/store/cache/factory.go:52 +0x33a
main.runStore(_, {_, _}, _, {_, _}, {_, _, _}, {0xc0009cd1f0, ...}, ...)
    /app/cmd/thanos/store.go:289 +0x912
main.registerStore.func1(0x1e56c60?, {0x256cf60, 0xc000f500a0}, 0x6?, {0x257e9d0, 0x37f6778}, 0xc000e43cf0?, 0x47?)
    /app/cmd/thanos/store.go:195 +0x288
main.main()
    /app/cmd/thanos/main.go:130 +0x1126
verejoel commented 1 year ago

Same problem here. Interestingly the same config works for the query-frontend results cache.

giskou commented 3 weeks ago

This is still an issue with v0.35.1 Is there some workaround available?