redis / ioredis

🚀 A robust, performance-focused, and full-featured Redis client for Node.js.
MIT License
14.39k stars 1.2k forks source link

ERR MOVED with ioredis 5.3.1 when connected to ElastiCache Serverless #1865

Open jdwit opened 8 months ago

jdwit commented 8 months ago

Occasionally the following error is reported:

{
    "command": {
        "args": [
            "<key_redacted>",
            "<value_redacted>",
            "EX",
            "3600"
        ],
        "name": "set"
    },
    "level": "error",
    "message": "ERR MOVED 7572 <redacted>-aud8py.serverless.euw1.cache.amazonaws.com:6379",
    "stack": "ReplyError: ERR MOVED 7572 <redacted>-aud8py.serverless.euw1.cache.amazonaws.com:6379\n    at parseError (/app/node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:179:12)\n    at parseType (/app/node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:302:14)",
    "timestamp": "2024-03-06 08:33:58"
}

When the app container is restarted (and a new Redis.Cluster client instance is created) the problem does not occur anymore.

The client is created as follows:

const url = new URL(connectionString)
const client = new Redis.Cluster(
  [
    {
      host: url.hostname,
      port: parseInt(url.port, 10)
    }
  ],
  {
    slotsRefreshTimeout: 5000,
    retryDelayOnMoved: 1000,
    maxRedirections: 64,
    // options required when connecting to ElastiCache:
    dnsLookup: (address, callback) => callback(null, address),
    redisOptions: { tls: {} }
  }
)

I added the retryDelayOnMoved setting after this was reported for the first time but this does not seem to have effect.

Can this be an issue of the ioredis client? The message ERRO MOVED 7572 url:post seems completely valid - see Redis docs section Moved Redirection.

Can this be related to this note: Aws Elasticache Clusters with TLS.

Any help is welcome. This issue is hard for us to debug since we are not able to reproduce it.

adamshugar commented 8 months ago

I'm having the same problem with ioredis 5.3.2:

I believe it's because every 11 days or so, AWS Serverless closes the underlying connection (resulting in ECONNRESET) and swaps in a new underlying node available at the same <host>:<port>. But ioredis assumes this can't happen, hence retryStrategy is always null in cluster mode (source). Note that I already tried retryStrategy and reconnectOnError, both to no avail.

I ran with debug logs enabled and got this:

# GET and SET working fine before this point

2024-03-13T11:48:01.363Z ioredis:connection error: Error: read ECONNRESET
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}
2024-03-13T11:48:01.368Z ioredis:redis status[<REDACTED>:6379]: ready -> close
2024-03-13T11:48:01.368Z ioredis:connection skip reconnecting because `retryStrategy` is not a function
2024-03-13T11:48:01.368Z ioredis:redis status[<REDACTED>:6379]: close -> end
2024-03-13T11:48:01.369Z ioredis:cluster:connectionPool Remove <REDACTED>.serverless.use2.cache.amazonaws.com:6379 from the pool
2024-03-13T11:48:01.369Z ioredis:cluster:subscriber subscriber has left, selecting a new one...
2024-03-13T11:48:01.370Z ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(subscriber))]: wait -> close
2024-03-13T11:48:01.370Z ioredis:connection skip reconnecting since the connection is manually closed.
2024-03-13T11:48:01.370Z ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(subscriber))]: close -> end
2024-03-13T11:48:01.370Z ioredis:cluster:subscriber selected a subscriber <REDACTED>.serverless.use2.cache.amazonaws.com:6380
2024-03-13T11:48:01.371Z ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6380 (ioredis-cluster(subscriber))]: wait -> wait

# GET returning null, and SET throwing error past this point

Errors after this point look like:

{
  "name": "ReplyError",
  "message": "ERR MOVED 7351 <REDACTED>.serverless.use2.cache.amazonaws.com:6379",
  "stack": [
    "parseError (node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:179:12)",
    "parseType (node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:302:14)"
  ]
}

Due to:

2024-03-14T05:14:47.495Z ioredis:redis write command[<REDACTED>:6380]: 0 -> set([ ... ])

Minimum Reproducible Issue

import http from "http";
import net from "net";
import { parse } from "url";
import { getRedisClient } from "your-redis-client-constructor-here";

const hostname = "127.0.0.1";
const port = 4020;

// Needs to be Redis.Cluster
const redisClient = getRedisClient();

const econnresetAllNodes = (clusterClient) => {
  const allNodes = Object.values(clusterClient.connectionPool.nodes.all);
  for (const node of allNodes) {
    if (!node.stream) continue;
    // Even though it's named "stream", it's actually a NodeJS socket
    node.stream.destroy();
  }
};

const server = http.createServer((req, res) => {
  const url = parse(req.url, true);

  if (req.method === "PUT" && url.pathname === "/connreset") {
    req.on("data", (chunk) => {});
    req.on("end", () => {
      econnresetAllNodes(redisClient);
      res.statusCode = 200;
      res.setHeader("Content-Type", "text/plain");
      res.end("Connection reset successfully\n");
    });
  } else {
    res.statusCode = 404;
    res.end("Not Found\n");
  }
});

server.listen(port, hostname, () => {
  console.log(`Server running at http://${hostname}:${port}/`);
});

Then after DEBUG=ioredis:* node index.js, do curl -X PUT localhost:4020/connreset. Same issue results:

ioredis:redis status[<REDACTED>:6379]: ready -> close +3s
ioredis:connection skip reconnecting because `retryStrategy` is not a function +3s
ioredis:redis status[<REDACTED>:6379]: close -> end +1ms
ioredis:cluster:connectionPool Remove <REDACTED>.serverless.use2.cache.amazonaws.com:6379 from the pool +3s
ioredis:cluster:subscriber subscriber has left, selecting a new one... +3s
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(subscriber))]: wait -> close +0ms
ioredis:connection skip reconnecting since the connection is manually closed. +1ms
ioredis:redis status[<REDACTED>..serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(subscriber))]: close -> end +0ms
ioredis:cluster:subscriber selected a subscriber <REDACTED>.serverless.use2.cache.amazonaws.com:6380 +1ms
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6380 (ioredis-cluster(subscriber))]: wait -> wait +1ms

Initial Connection Logs (if helpful)

ioredis:cluster status: [empty] -> connecting +0ms
ioredis:cluster resolved hostname <REDACTED>.serverless.use2.cache.amazonaws.com to IP <REDACTED>.serverless.use2.cache.amazonaws.com +3ms
ioredis:cluster:connectionPool Reset with [
ioredis:cluster:connectionPool   {
ioredis:cluster:connectionPool     port: 6379,
ioredis:cluster:connectionPool     host: '<REDACTED>.serverless.use2.cache.amazonaws.com',
ioredis:cluster:connectionPool     tls: {},
ioredis:cluster:connectionPool     reconnectOnError: [Function: reconnectOnError]
ioredis:cluster:connectionPool   }
ioredis:cluster:connectionPool ] +0ms
ioredis:cluster:connectionPool Connecting to <REDACTED>.serverless.use2.cache.amazonaws.com:6379 as master +3ms
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379]: wait -> wait +0ms
ioredis:cluster getting slot cache from <REDACTED>.serverless.use2.cache.amazonaws.com:6379 +7ms
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(refresher))]: wait -> wait +1ms
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(refresher))]: wait -> connecting +1ms
ioredis:redis queue command[<REDACTED>.serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(refresher))]: 0 -> cluster([ 'SLOTS' ]) +1ms
ioredis:cluster:subscriber selected a subscriber <REDACTED>.serverless.use2.cache.amazonaws.com:6379 +0ms
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379 (ioredis-cluster(subscriber))]: wait -> wait +1ms
ioredis:cluster:subscriber started +0ms
ioredis:redis status[<REDACTED>:6379 (ioredis-cluster(refresher))]: connecting -> connect +62ms
ioredis:redis write command[<REDACTED>:6379 (ioredis-cluster(refresher))]: 0 -> auth([ 'default', '' ]) +2ms
ioredis:redis status[<REDACTED>:6379 (ioredis-cluster(refresher))]: connect -> ready +1ms
ioredis:connection set the connection name [ioredis-cluster(refresher)] +0ms
ioredis:redis write command[<REDACTED>:6379 (ioredis-cluster(refresher))]: 0 -> client([ 'setname', 'ioredis-cluster(refresher)' ]) +0ms
ioredis:connection send 1 commands in offline queue +0ms
ioredis:redis write command[<REDACTED>:6379 (ioredis-cluster(refresher))]: 0 -> cluster([ 'SLOTS' ]) +1ms
ioredis:cluster cluster slots result count: 1 +86ms
ioredis:cluster cluster slots result [0]: slots 0~16383 served by [
'<REDACTED>.serverless.use2.cache.amazonaws.com:6379',
'<REDACTED>.serverless.use2.cache.amazonaws.com:6380'
] +0ms
ioredis:cluster:connectionPool Reset with [
ioredis:cluster:connectionPool   {
ioredis:cluster:connectionPool     host: '<REDACTED>.serverless.use2.cache.amazonaws.com',
ioredis:cluster:connectionPool     port: 6379,
ioredis:cluster:connectionPool     readOnly: false
ioredis:cluster:connectionPool   },
ioredis:cluster:connectionPool   {
ioredis:cluster:connectionPool     host: '<REDACTED>.serverless.use2.cache.amazonaws.com',
ioredis:cluster:connectionPool     port: 6380,
ioredis:cluster:connectionPool     readOnly: true
ioredis:cluster:connectionPool   }
ioredis:cluster:connectionPool ] +103ms
ioredis:cluster:connectionPool Connecting to <REDACTED>.serverless.use2.cache.amazonaws.com:6380 as slave +1ms
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6380]: wait -> wait +33ms
ioredis:cluster status: connecting -> connect +17ms
ioredis:redis status[<REDACTED>.serverless.use2.cache.amazonaws.com:6379]: wait -> connecting +2ms
ioredis:redis queue command[<REDACTED>.serverless.use2.cache.amazonaws.com:6379]: 0 -> cluster([ 'INFO' ]) +0ms
ioredis:redis status[<REDACTED>:6379 (ioredis-cluster(refresher))]: ready -> close +3ms
ioredis:connection skip reconnecting since the connection is manually closed. +39ms
ioredis:redis status[<REDACTED>:6379 (ioredis-cluster(refresher))]: close -> end +0ms
ioredis:redis status[<REDACTED>:6379]: connecting -> connect +14ms
ioredis:redis write command[<REDACTED>:6379]: 0 -> auth([ 'default', '' ]) +1ms
ioredis:redis write command[<REDACTED>:6379]: 0 -> info([]) +0ms
ioredis:redis status[<REDACTED>:6379]: connect -> ready +3ms
ioredis:connection send 1 commands in offline queue +19ms
ioredis:redis write command[<REDACTED>:6379]: 0 -> cluster([ 'INFO' ]) +1ms
ioredis:cluster status: connect -> ready +25ms
adamshugar commented 8 months ago

On further investigation, we decided to move to node-redis; it seems like this project is now in maintenance mode.

Ethan-99 commented 7 months ago

@adamshugar Thanks for the comment, having a similar problem and came to the same conclusion. Wondering if you'd be willing to post your working node-redis client config for serverless elasticache in this thread.

sirrodgepodge commented 4 months ago

I'm using bullmq which unfortunately has decided not to move: https://github.com/taskforcesh/bullmq/issues/2490

wondering if anyone has solved this one...