nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.92k stars 1.41k forks source link

Ephemeral consumer not receiving message when connected to a certain consumer leader #4421

Closed debajyoti-truefoundry closed 4 months ago

debajyoti-truefoundry commented 1 year ago

What version were you using?

nats-server: v2.9.8 Clients

What environment was the server running in?

Image: nats:2.9.8-alpine OS: Linux (amd64) Helm Values:

  nats:
    logging:
      debug: false
    advertise: false
    jetstream:
      enabled: true
      memStorage:
        size: 1Gi
        enabled: true
      fileStorage:
        size: 10Gi
        enabled: true
        accessModes:
          - ReadWriteOnce
        storageDirectory: /data
    resources:
      limits:
        cpu: 400m
        memory: 512Mi
      requests:
        cpu: 200m
        memory: 256Mi
  cluster:
    enabled: true
    replicas: 3
    noAdvertise: true
  enabled: true
  websocket:
    port: 443
    noTLS: true
    enabled: true
    sameOrigin: false
    allowedOrigins: []

Is this defect reproducible?

Unfortunately issue is not reproducible. I have described my observations on the issue below.

I have a stream with the following info,

{
  "config": {
    "name": "tfy-agent",
    "subjects": [
      "tfy-agent.\u003e"
    ],
    "retention": "limits",
    "max_consumers": -1,
    "max_msgs_per_subject": 1,
    "max_msgs": -1,
    "max_bytes": -1,
    "max_age": 7200000000000,
    "max_msg_size": -1,
    "storage": "file",
    "discard": "old",
    "num_replicas": 3,
    "duplicate_window": 120000000000,
    "sealed": false,
    "deny_delete": false,
    "deny_purge": false,
    "allow_rollup_hdrs": true,
    "allow_direct": false,
    "mirror_direct": false
  },
  "created": "2022-12-06T13:48:48.237779254Z",
  "state": {
    "messages": 2059,
    "bytes": 2507895,
    "first_seq": 453880922,
    "first_ts": "2023-08-23T01:40:23.903614866Z",
    "last_seq": 454064480,
    "last_ts": "2023-08-23T02:23:53.787500686Z",
    "num_deleted": 181500,
    "num_subjects": 2059,
    "consumer_count": 872
  },
  "cluster": {
    "name": "nats",
    "leader": "truefoundry-nats-2",
    "replicas": [
      {
        "name": "truefoundry-nats-0",
        "current": true,
        "active": 38268842
      },
      {
        "name": "truefoundry-nats-1",
        "current": true,
        "active": 37891442
      }
    ]
  }
}

I create an ephemeral consumer,

{"config":{"deliver_policy":"all","ack_policy":"explicit","ack_wait":5000000000,"replay_policy":"instant","name":"","max_ack_pending":1,"deliver_subject":"_INBOX.ZCMDIWKCK7HMG8PDQEAH96","filter_subject":"tfy-agent.foo.bar"},"stream_name":"tfy-agent"}

And get back this response.

{"type":"io.nats.jetstream.api.v1.consumer_create_response","stream_name":"tfy-agent","name":"SOxTA5HB","created":"2023-08-22T09:26:01.882169339Z","config":{"deliver_policy":"all","ack_policy":"explicit","ack_wait":5000000000,"max_deliver":-1,"filter_subject":"tfy-agent.foo.bar","replay_policy":"instant","max_ack_pending":1,"deliver_subject":"_INBOX.ZCMDIWKCK7HMG8PDQEAH96","inactive_threshold":5000000000,"num_replicas":0},"delivered":{"consumer_seq":0,"stream_seq":450100821},"ack_floor":{"consumer_seq":0,"stream_seq":0},"num_ack_pending":0,"num_redelivered":0,"num_waiting":0,"num_pending":0,"cluster":{"name":"nats","leader":"truefoundry-nats-0"},"push_bound":true}

Even though I have a message with the subject tfy-agent.foo.bar on the stream I do not get back any message.

Upon checking the consumer info, this is what I can see,

Cluster Information:

                Name: nats
              Leader: truefoundry-nats-0

State:

   Last Delivered Message: Consumer sequence: 0 Stream sequence: 450,622,760
     Acknowledgment floor: Consumer sequence: 0 Stream sequence: 0
         Outstanding Acks: 0 out of maximum 1
     Redelivered Messages: 0
     Unprocessed Messages: 0
          Active Interest: Active

Upon further investigation, I observed that whenever the leader for the ephemeral consumer is truefoundry-nats-0 pod, I do not get any message. I used the code snippet below to verify this,

async function natsIssue(): Promise<void> {
  const textEncoder = new TextEncoder();
  const subjectsToSubscribe = [
    `tfy-agent.foo.bar`,
    `tfy-agent.foo.baz`,
    `tfy-agent.foo.abc`,
  ];
  const natsConnection = await connect({
    servers: '',
    authenticator: jwtAuthenticator(
      ``,
      textEncoder.encode(``),
    ),
  });

  const ackPending = new Map<string, number>([
    ['truefoundry-nats-0', 0],
    ['truefoundry-nats-1', 0],
    ['truefoundry-nats-2', 0],
  ]);
  const leaderAssignment = new Map<string, number>([
    ['truefoundry-nats-0', 0],
    ['truefoundry-nats-1', 0],
    ['truefoundry-nats-2', 0],
  ]);

  const jetStreamClient = natsConnection.jetstream();

  let counter = 0;
  while (counter < 100) {
    const updates = subjectsToSubscribe.map(async (subject: string) => {
      const opts = consumerOpts();
      opts.ackExplicit();
      opts.ackWait(20 * 1000);
      opts.maxAckPending(1);
      opts.deliverAll();
      opts.deliverTo(createInbox());

      const sub = await jetStreamClient.subscribe(subject, opts);
      const consumerInfo = await sub.consumerInfo();
      const leader = consumerInfo.cluster.leader;
      const numAckPending = consumerInfo.num_ack_pending;
      ackPending.set(leader, ackPending.get(leader) + numAckPending);
      leaderAssignment.set(leader, leaderAssignment.get(leader) + 1);
      sub.drain();
    });
    await Promise.all(updates);
    console.log('ackPending', ackPending);
    console.log('leaderAssignment', leaderAssignment);
    console.log('======');
    counter++;
  }
  process.exit(0);
}

I get the output below in the end,

ackPending Map(3) {
  'truefoundry-nats-0' => 0,
  'truefoundry-nats-1' => 91,
  'truefoundry-nats-2' => 110
}
leaderAssignment Map(3) {
  'truefoundry-nats-0' => 99,
  'truefoundry-nats-1' => 91,
  'truefoundry-nats-2' => 110
}

This issue was resolved after truefoundry-nats-0 restarted.

Given the capability you are leveraging, describe your expectation?

I should be getting messages regardless of the ephemeral consumer leader.

Given the expectation, what is the defect you are observing?

I do not get messages if the ephemeral consumer leader is the truefoundry-nats-0 pod.

derekcollison commented 1 year ago

You might consider upgrading your server to the latest, which is 2.9.21 and we are about to release 2.9.22.

debajyoti-truefoundry commented 1 year ago

You might consider upgrading your server to the latest, which is 2.9.21 and we are about to release 2.9.22.

Thanks. Can I just update the nats.image.tag field in my helm chart? Or is there an upgrade procedure? I am currently using chart version 0.19.1.

ripienaar commented 1 year ago

Suggest you read release notes, at least https://github.com/nats-io/nats-server/releases/tag/v2.0.0 and also 2.1.0 ones.

We cannot tell you what will happen with such a large jump in versions, you have to test this in dev or staging.

debajyoti-truefoundry commented 4 months ago

I am closing this issue as the upgrade was smooth and solved by issue.