redis / go-redis

Redis Go client
https://redis.uptrace.dev
BSD 2-Clause "Simplified" License
19.61k stars 2.31k forks source link

go-redis/v9: incr, ping commands getting EOF constantly after about a day of successful periodic (once in few seconds) calls #2939

Open chitturs opened 3 months ago

chitturs commented 3 months ago

Issue tracker is used for reporting bugs and discussing new features. Please use stackoverflow for supporting issues.

Client: GoLang using redisv9 Server: Azure Redis Cache

Client reads messages from Azure service bus and then updates the Redis Cache using Incr(). This is repeated ad infinitum. Everything works for about a day. The calls are made once in few seconds. After about a day, Incr() returns EOF and never succeeds. At that point, ping() also returns EOF.

Expected Behavior

No failures in Incr() or ping().

Current Behavior

Incr() and ping() fail with EOF.

Possible Solution

It is unclear what the problem is. Suspicion is a broken connection. The connection should never be idle for the idle timeout to kick in. The server has an idle timeout of 10 minutes.

Steps to Reproduce

import (
      "context"
      "crypto/tls"
      "errors"
      "fmt"
      "os"
      "strings"
      "github.com/Azure/azure-sdk-for-go/sdk/azcore/policy"
      "github.com/Azure/azure-sdk-for-go/sdk/azidentity"
      "github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus"
      "github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/redis/armredis/v3"
      "github.com/redis/go-redis/v9"
  )
      op := &redis.Options{Addr: redisHostName + ":" + fmt.Sprint(*getRes.Properties.SSLPort), Username: msiObjectId, Password: accessToken.Token, TLSConfig: &tls.Config{MinVersion: tls.VersionTLS12}}
client := redis.NewClient(op)

    for _, message := range messages {
        body := message.Body
        redisCacheKey := strings.Trim(string(body), `"`)
        result, err := redisclient.Incr(ctx, redisCacheKey).Result()
        if err != nil {
            if errors.Is(err, context.Canceled) {
                fmt.Println("context was canceled while incrementing cache, return")
                return
            }
            **fmt.Println(fmt.Errorf("redis cache increment failed for key %s, err %v", redisCacheKey, err))**

            // Check if the connection is still alive, https://redis.io/commands/ping/.
            pong, err := redisclient.Ping(ctx).Result()
            fmt.Println(fmt.Errorf("redis cache ping response %s, err %v, timeouts %d", pong, err, redisclient.PoolStats().Timeouts))
            continue
        }

        fmt.Println("Key:", redisCacheKey, "Result:", result)
    }

Context (Environment)

My AKS cluster app is partially broken in functionality. The Redis Cache is used for stats purposes of other activities in the cluster. The stats is now not working after about a day.

chriscasola commented 3 weeks ago

We are seeing similar behavior. In our case we have a status check on our service that does a redisClient.Set(ctx, "status-key", 1, 0) and eventually that starts returning EOF. This is also effecting our other redis calls once the EOF is encountered, but things seem to recover on their own without a restart of the service.

I'm not sure how to debug this, but I also suspect a connection in the pool that is in a bad state.

Should it be possible for a connection in the pool to be unusable?

Update: we have this service running in two environments, one with redis 7.0.11 and one with 6.2.11 -- this is only happening in the redis 7.0.11 environment.