thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
Apache License 2.0
12.99k stars 2.08k forks source link

Troubleshoot connecting Thanos QueryFrontend to AWS ElasticCache Redis with TLS #6732

Open vaillani opened 1 year ago

vaillani commented 1 year ago

Connection issue when trying to connect Thanos QueryFrontend to an AWS ElasticCache Redis with TLS enabled.

Thanos, Prometheus and Golang version used:

Thanos v0.32.2 with AWS ElasticCache Redis 6.2.6 with Encryption in transit enabled

This is the configuration used:

        tls_enabled: true

    type: "redis"

Object Storage Provider: AWS

What happened:

When I tried to connect to AWS ElasticCache Redis cluster with TLS in transit, I got a connection issue: context deadline exceeded.

I think it is because of missing root certificates because when I used a alpine image and install the root certificates which include Amazon_Root_CA it worked well.

redis-cli -p 6379 --tls

I tried to add those certificates with an initContainer but I got the same connection issue.

What you expected to happen:

Connect successfully Thanos QueryFrontend to ElasticCache Redis cluster with TLS.

Full logs to relevant components:

ts=2023-09-18T16:59:05.422890922Z caller=main.go:135 level=error err="creating redis client: context deadline exceeded\\n\t/app/internal/cortex/chunk/cache/cache.go:108\\n\t/app/internal/cortex/querier/queryrange/results_cache.go:187\\n\t/app/pkg/queryfrontend/roundtrip.go:199\\n\t/app/pkg/queryfrontend/roundtrip.go:58\nmain.runQueryFrontend\n\t/app/cmd/thanos/query_frontend.go:254\nmain.registerQueryFrontend.func1\n\t/app/cmd/thanos/query_frontend.go:160\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\ncreate results cache middleware\\n\t/app/pkg/queryfrontend/roundtrip.go:211\\n\t/app/pkg/queryfrontend/roundtrip.go:58\nmain.runQueryFrontend\n\t/app/cmd/thanos/query_frontend.go:254\nmain.registerQueryFrontend.func1\n\t/app/cmd/thanos/query_frontend.go:160\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nsetup tripperwares\nmain.runQueryFrontend\n\t/app/cmd/thanos/query_frontend.go:256\nmain.registerQueryFrontend.func1\n\t/app/cmd/thanos/query_frontend.go:160\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\npreparing query-frontend command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"
douglascamata commented 1 year ago

The Redis cache configurations accepts the CA file as configuration, see the official docs:

vaillani commented 1 year ago

Thanks for the aswer, I have already tried this, with ca-certificates lib there are 4 Amazon Root certificates, I merged them into one unique file and add the path into tls_config :

          tls_enabled: true
              ca_file: /etc/ssl/certs/ca-cert-Amazon_Root_CA.pem
    type: "redis"

I got the same issue: creating redis client: context deadline exceeded

douglascamata commented 1 year ago

You can add use the insecure option to skip the cert check. 👀

Otherwise unfortunately I can't help anymore, I don't have this kind of setup.

vaillani commented 1 year ago

I tried also to skip the option : insecure_skip_verify: true it doesn't seem to have any impact on my issue

douglascamata commented 1 year ago

Did you try already without in transit encryption? This seems weird... it's like a timeout somewhere

vaillani commented 1 year ago

Yes for the moment we use ElasticCache Redis without TLS encryption it works well

mhamzahkhan commented 11 months ago

I'm also experiencing this issue. I have redis configured with TLS, but query-frontend cannot connect to it.

I dug into the code a bit, and from what I can tell I don't think TLS has been implemented yet?

The NewCacheConfig parser for Redis doesn't seem to pass any TLS options over to the cortex cache config, so it doesn't get enabled.

Just doing a quick test, made the following change:

diff --git a/pkg/queryfrontend/config.go b/pkg/queryfrontend/config.go
index a5655199..80e7f3f0 100644
--- a/pkg/queryfrontend/config.go
+++ b/pkg/queryfrontend/config.go
@@ -166,6 +166,8 @@ func NewCacheConfig(logger log.Logger, confContentYaml []byte) (*cortexcache.Con
                                Expiration: config.Expiration,
                                DB:         config.Redis.DB,
                                Password:   flagext.Secret{Value: config.Redis.Password},
+                               EnableTLS:  true,
+                               InsecureSkipVerify:  true,
                        Background: cortexcache.BackgroundConfig{
                                WriteBackBuffer:     config.Redis.MaxSetMultiConcurrency * config.Redis.SetMultiBatchSize,

recompiled, and query-frontend is able to connect to my Redis using TLS.

Unfortunately my Golang skills are quire limited so I don't know how to fix this properly.

gnomeria commented 8 months ago

Would this then categorize as a bug since the insecure_skip_verify and tls_enabled is not passed down during convert to the cortexcache.RedisConfig ?

kaiohenricunha commented 3 days ago

Would this then categorize as a bug since the insecure_skip_verify and tls_enabled is not passed down during convert to the cortexcache.RedisConfig ?

I guess so. Any workaround? Having the same issue on thanos store index cache, when applying this configuration:

indexCacheConfig: |
  addr:        # Redis address and port
  db: 0                                     # Redis database (default: 0)
  dial_timeout: 20s                # Dial timeout (default: 5s)
  read_timeout: 20s                # Read timeout (default: 3s)
  write_timeout: 20s              # Write timeout (default: 3s)

  # Concurrency and batch settings
  max_get_multi_concurrency: 50  # Max concurrent GET operations (default: 50)
  get_multi_batch_size: 50      # Batch size for GET operations (default: 50)
  max_set_multi_concurrency: 50  # Max concurrent SET operations (default: 50)
  set_multi_batch_size: 50      # Batch size for SET operations (default: 50)

  # Cache and async settings
  cache_size: 128MB                 # Cache size (default: 128MB)
  max_async_buffer_size: 5000 # Async buffer size (default: 5000)
  max_async_concurrency: 20  # Async concurrency (default: 20)

  # Circuit breaker settings
    enabled: true      # Circuit breaker enabled (default: true)
    half_open_max_requests: 10  # Max requests during half-open state (default: 10)
    open_duration: 5s    # Circuit breaker open duration (default: 5s)
    min_requests: 50      # Min requests before tracking failures (default: 50)
    consecutive_failures: 5      # Consecutive failures to open circuit breaker (default: 5)
    failure_percent: 0.05 # Failure percentage to open circuit breaker (default: 0.05 or 5%)

  # Caching specific items (empty by default)
  enabled_items: [""]            # Default: empty (all items cached by default). Possible values: Postings, Series, ExpandedPostings.

  # Time-to-live (TTL) for cached items
  ttl: 30m                                 # TTL for cached index items (default: 30 minutes)

I use bitnami helm chart.