nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.92k stars 1.41k forks source link

JetStream KV gets consumer stuck in the cluster server #5352

Closed ThalesValentim closed 6 months ago

ThalesValentim commented 6 months ago

Observed behavior

The KV consumer subscription remains active after the client reconnects to another available cluster node.

Expected behavior

Consumers should be unsubscribed if the client has reconnected to another available cluster node.

Server and client version

Host environment

Steps to reproduce

Considering that nats-server has 3 cluster servers running

  1. Run an app to watch for JetStream KV updates with a cluster mode setup, Node.js example:
const nc = await nats.connect({
    servers: ['localhost:4111','localhost:4222','localhost:4333'],
  });
  const sc = nats.StringCodec();
  const js = nc.jetstream();
  const kv = await js.views.kv("testing");
  const jsm = await nc.jetstreamManager();
  const stream = await jsm.streams.get('KV_testing');
  const { config } = await stream.info();

  await jsm.streams.update(config.name, {
    ...config,
    num_replicas: 3
  });

  const watch = await kv.watch();
  (async () => {
    for await (const e of watch) {
      // do something with the change
      console.log(
        `watch: ${e.key}: ${e.operation} ${e.value ? sc.decode(e.value) : ""}`,
      );
    }
  })();

  kv.put('hello.word', sc.encode("hi"));
  1. Check which cluster has been chosen: nats consumer report KV_testing
  2. Force container restart/node fail situation, so the client will connect to another available cluster node
  3. Check the consumer report again: nats consumer report KV_testing

Watcher connects to a new cluster node but the old subscription remains active in the previous cluster node:

Screenshot 2024-04-25 at 10 54 13
ripienaar commented 6 months ago

Can you show nats consumer info on the one that you think is an orphan?

ThalesValentim commented 6 months ago

Can you show nats consumer info on the one that you think is an orphan?

Please, follow the picture:

Screenshot 2024-04-25 at 12 39 06
ripienaar commented 6 months ago

In this case the server believes someone is subscribed to that subject (_INBOX.SRP....)

You can check with:

$ nats server req subscriptions --detail --filter-subject _INBOX.xxx

Putting the correct subject there, this should confirm if someone is subscribed and who. As long as there is a subscriber it wont be deleted

ThalesValentim commented 6 months ago

In this case the server believes someone is subscribed to that subject (_INBOX.SRP....)

You can check with:

$ nats server req subscriptions --detail --filter-subject _INBOX.xxx

Putting the correct subject there, this should confirm if someone is subscribed and who. As long as there is a subscriber it wont be deleted

I set up an account to run the command, so the delivery subject is now _INBOX.B8Z5D8WILGEI7LAIK3KQZT

nats consumer report KV_testing

Screenshot 2024-04-25 at 13 32 35

nats consumer info KV_testing

Screenshot 2024-04-25 at 13 32 56

The command output showed that _INBOX.B8Z5D8WILGEI7LAIK3KQZT is subscribed to nats2:

{
  "server": {
    "name": "nats2",
    "host": "0.0.0.0",
    "id": "NAC23VTKLWNAIIWTXRVW57IBJQNVETGVAJ7MV35JHKPBWXGJQCVZI7DU",
    "cluster": "nats",
    "ver": "2.10.14",
    "jetstream": true,
    "flags": 3,
    "seq": 93,
    "time": "2024-04-25T10:26:02.902416555Z"
  },
  "data": {
    "server_id": "NAC23VTKLWNAIIWTXRVW57IBJQNVETGVAJ7MV35JHKPBWXGJQCVZI7DU",
    "now": "2024-04-25T10:26:02.901771971Z",
    "num_subscriptions": 219,
    "num_cache": 38,
    "num_inserts": 324,
    "num_removes": 105,
    "num_matches": 488,
    "cache_hit_rate": 0.4959016393442623,
    "max_fanout": 3,
    "avg_fanout": 1.6842105263157894,
    "total": 1,
    "offset": 0,
    "limit": 2048,
    "subscriptions_list": [
      {
        "account": "$G",
        "subject": "_INBOX.B8Z5D8WILGEI7LAIK3KQZT",
        "sid": "2",
        "msgs": 95,
        "cid": 26
      }
    ]
  }
}
ripienaar commented 6 months ago

OK, so you can see the cid: 26 there, use nats server report connections to figure out where that is and then check on the client shown if its connected you should have IP and port. Use netstat to confirm.

ThalesValentim commented 6 months ago

OK, so you can see the cid: 26 there, use nats server report connections to figure out where that is and then check on the client shown if its connected you should have IP and port. Use netstat to confirm.

cid: 26 refers to the client app that is expected to have one subscription to the KV when reconnects to an available node.

Screenshot 2024-04-25 at 14 36 06

I opened the issue 688 recently, I'm wondering if could it be still a client issue?

ripienaar commented 6 months ago

Well when you reconnect quickly the old ephemeral subscription will still be there (5s seconds) so when the reconnect happens and the client detect the ephemeral is still valid it will continue to use it. This is desired behavior.

Why is that a problem, what problems does it cause?

ThalesValentim commented 6 months ago

Well when you reconnect quickly the old ephemeral subscription will still be there (5s seconds) so when the reconnect happens and the client detect the ephemeral is still valid it will continue to use it. This is desired behavior.

Why is that a problem, what problems does it cause?

The old subscription remains active forever, not only 5 seconds. That doesn't seem correct. I would expect the server to report only 1 active subscription per KV watcher.

nats consumer report KV_testing

Screenshot 2024-04-25 at 15 07 59

The KV watch client consumer B8Z5D8WILGEI7LAIK3KRR9 was created in the node nats3 because nats1 wasn't available to reconnect to it. Shouldn't B8Z5D8WILGEI7LAIK3KQSY be gone as the client might never consume it again?

ripienaar commented 6 months ago

If you have a old subscription WITHOUT interest on it - nothing is subscribed and you cant find it using the nats methods shown. Then yes it should be gone after 5 seconds, if not that is a server bug. So far we have not seen one like that though?

What you see might be a client bug though - keeping interest when it shouldnt or maybe error handling is bad and so a new sub is made without cleanup of the old one so interest is kept.