thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.11k stars 2.1k forks source link

Troubleshoot connecting Thanos QueryFrontend to AWS ElasticCache Redis with TLS #6732

Open vaillani opened 1 year ago

vaillani commented 1 year ago

Connection issue when trying to connect Thanos QueryFrontend to an AWS ElasticCache Redis with TLS enabled.

Thanos, Prometheus and Golang version used:

Thanos v0.32.2 with AWS ElasticCache Redis 6.2.6 with Encryption in transit enabled

This is the configuration used:

--query-range.response-cache-config=
    config:
        addr: XXX.cache.amazonaws.com:6379
        tls_enabled: true

    type: "redis"

Object Storage Provider: AWS

What happened:

When I tried to connect to AWS ElasticCache Redis cluster with TLS in transit, I got a connection issue: context deadline exceeded.

I think it is because of missing root certificates because when I used a alpine image and install the root certificates which include Amazon_Root_CA it worked well.

redis-cli XXX.cache.amazonaws.com -p 6379 --tls

I tried to add those certificates with an initContainer but I got the same connection issue.

What you expected to happen:

Connect successfully Thanos QueryFrontend to ElasticCache Redis cluster with TLS.

Full logs to relevant components:

ts=2023-09-18T16:59:05.422890922Z caller=main.go:135 level=error err="creating redis client: context deadline exceeded\ngithub.com/thanos-io/thanos/internal/cortex/chunk/cache.New\n\t/app/internal/cortex/chunk/cache/cache.go:108\ngithub.com/thanos-io/thanos/internal/cortex/querier/queryrange.NewResultsCacheMiddleware\n\t/app/internal/cortex/querier/queryrange/results_cache.go:187\ngithub.com/thanos-io/thanos/pkg/queryfrontend.newQueryRangeTripperware\n\t/app/pkg/queryfrontend/roundtrip.go:199\ngithub.com/thanos-io/thanos/pkg/queryfrontend.NewTripperware\n\t/app/pkg/queryfrontend/roundtrip.go:58\nmain.runQueryFrontend\n\t/app/cmd/thanos/query_frontend.go:254\nmain.registerQueryFrontend.func1\n\t/app/cmd/thanos/query_frontend.go:160\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\ncreate results cache middleware\ngithub.com/thanos-io/thanos/pkg/queryfrontend.newQueryRangeTripperware\n\t/app/pkg/queryfrontend/roundtrip.go:211\ngithub.com/thanos-io/thanos/pkg/queryfrontend.NewTripperware\n\t/app/pkg/queryfrontend/roundtrip.go:58\nmain.runQueryFrontend\n\t/app/cmd/thanos/query_frontend.go:254\nmain.registerQueryFrontend.func1\n\t/app/cmd/thanos/query_frontend.go:160\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nsetup tripperwares\nmain.runQueryFrontend\n\t/app/cmd/thanos/query_frontend.go:256\nmain.registerQueryFrontend.func1\n\t/app/cmd/thanos/query_frontend.go:160\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\npreparing query-frontend command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"
douglascamata commented 1 year ago

The Redis cache configurations accepts the CA file as configuration, see the official docs: https://thanos.io/tip/components/store.md/#redis-index-cache

vaillani commented 1 year ago

Thanks for the aswer, I have already tried this, with ca-certificates lib there are 4 Amazon Root certificates, I merged them into one unique file and add the path into tls_config :

--query-range.response-cache-config=
    config:
          addr: XXX.cache.amazonaws.com:6379
          tls_enabled: true
          tls_config:
              ca_file: /etc/ssl/certs/ca-cert-Amazon_Root_CA.pem
    type: "redis"

I got the same issue: creating redis client: context deadline exceeded

douglascamata commented 1 year ago

You can add use the insecure option to skip the cert check. 👀

Otherwise unfortunately I can't help anymore, I don't have this kind of setup.

vaillani commented 1 year ago

I tried also to skip the option : insecure_skip_verify: true it doesn't seem to have any impact on my issue

douglascamata commented 1 year ago

Did you try already without in transit encryption? This seems weird... it's like a timeout somewhere

vaillani commented 1 year ago

Yes for the moment we use ElasticCache Redis without TLS encryption it works well

mhamzahkhan commented 1 year ago

I'm also experiencing this issue. I have redis configured with TLS, but query-frontend cannot connect to it.

I dug into the code a bit, and from what I can tell I don't think TLS has been implemented yet? https://github.com/thanos-io/thanos/blob/main/pkg/queryfrontend/config.go#L162

The NewCacheConfig parser for Redis doesn't seem to pass any TLS options over to the cortex cache config, so it doesn't get enabled.

Just doing a quick test, made the following change:

diff --git a/pkg/queryfrontend/config.go b/pkg/queryfrontend/config.go
index a5655199..80e7f3f0 100644
--- a/pkg/queryfrontend/config.go
+++ b/pkg/queryfrontend/config.go
@@ -166,6 +166,8 @@ func NewCacheConfig(logger log.Logger, confContentYaml []byte) (*cortexcache.Con
                                Expiration: config.Expiration,
                                DB:         config.Redis.DB,
                                Password:   flagext.Secret{Value: config.Redis.Password},
+                               EnableTLS:  true,
+                               InsecureSkipVerify:  true,
                        },
                        Background: cortexcache.BackgroundConfig{
                                WriteBackBuffer:     config.Redis.MaxSetMultiConcurrency * config.Redis.SetMultiBatchSize,

recompiled, and query-frontend is able to connect to my Redis using TLS.

Unfortunately my Golang skills are quire limited so I don't know how to fix this properly.

gnomeria commented 10 months ago

Would this then categorize as a bug since the insecure_skip_verify and tls_enabled is not passed down during convert to the cortexcache.RedisConfig ?

kaiohenricunha commented 1 month ago

Would this then categorize as a bug since the insecure_skip_verify and tls_enabled is not passed down during convert to the cortexcache.RedisConfig ?

I guess so. Any workaround? Having the same issue on thanos store index cache, when applying this configuration:

indexCacheConfig: |
  addr: master.raas-thanos-storegateway.xxx.xxx.cache.amazonaws.com:6379        # Redis address and port
  db: 0                                     # Redis database (default: 0)
  dial_timeout: 20s                # Dial timeout (default: 5s)
  read_timeout: 20s                # Read timeout (default: 3s)
  write_timeout: 20s              # Write timeout (default: 3s)

  # Concurrency and batch settings
  max_get_multi_concurrency: 50  # Max concurrent GET operations (default: 50)
  get_multi_batch_size: 50      # Batch size for GET operations (default: 50)
  max_set_multi_concurrency: 50  # Max concurrent SET operations (default: 50)
  set_multi_batch_size: 50      # Batch size for SET operations (default: 50)

  # Cache and async settings
  cache_size: 128MB                 # Cache size (default: 128MB)
  max_async_buffer_size: 5000 # Async buffer size (default: 5000)
  max_async_concurrency: 20  # Async concurrency (default: 20)

  # Circuit breaker settings
  set_async_circuit_breaker_config:
    enabled: true      # Circuit breaker enabled (default: true)
    half_open_max_requests: 10  # Max requests during half-open state (default: 10)
    open_duration: 5s    # Circuit breaker open duration (default: 5s)
    min_requests: 50      # Min requests before tracking failures (default: 50)
    consecutive_failures: 5      # Consecutive failures to open circuit breaker (default: 5)
    failure_percent: 0.05 # Failure percentage to open circuit breaker (default: 0.05 or 5%)

  # Caching specific items (empty by default)
  enabled_items: [""]            # Default: empty (all items cached by default). Possible values: Postings, Series, ExpandedPostings.

  # Time-to-live (TTL) for cached items
  ttl: 30m                                 # TTL for cached index items (default: 30 minutes)

I use bitnami helm chart.