nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.01k stars 1.41k forks source link

NATS Message Consumption Issue After Pod (NATS cluster) Restart in OpenShift #6025

Open mohamedsaleem18 opened 1 month ago

mohamedsaleem18 commented 1 month ago

Observed behavior

NATS Alpine image (NATS 2.10.19) with Jetstream enabled is deployed in a three-node cluster within a Red Hat OpenShift environment. A headless service is exposed for applications deployed in the same OpenShift cluster to connect.

Whenever the NATS cluster pods are restarted with a rolling update, the connected application can publish messages successfully but is unable to consume them. The application client must be restarted on their end to resolve this issue. Can you please provide a resolution for this problem?

The application client (Java) uses the following seed URLs to connect to the NATS cluster for publishing and subscribing to messages: nats://nats-0.nats-headless.ws-nats:4222,nats://nats-1.nats-headless.ws-nats:4222,nats://nats-2.nats-headless.ws-nats:4222

The NATS server in the cluster uses the following URLs in the nats-server.config to form the cluster: nats://nats-0.nats-headless.ws-nats:6222,nats://nats-1.nats-headless.ws-nats:6222,nats://nats-2.nats-headless.ws-nats:6222

Expected behavior

  1. The application should maintain its connection to the NATS cluster without needing to restart, even when pods are restarted or updated.

  2. The application should be able to publish messages to the NATS cluster successfully during and after the rolling updates of the pods.

  3. The application should be able to consume messages from the NATS cluster without interruption, receiving any messages that were published while it was connected.

  4. If a connection is lost due to a pod restart, the client should automatically attempt to reconnect to the NATS server.

Server and client version

NATS Alpine image (NATS 2.10.19) NATS Java client.

Host environment

RedHat OpenShift (on-premise)

Steps to reproduce

No response

neilalexander commented 1 month ago

Please can you provide nats stream info and nats consumer info for the assets in question?

mohamedsaleem18 commented 1 month ago

Stream info

`? Select a Stream lpn Information for Stream lpn created 2024-10-08 23:18:23

          Subjects: oe.lpn.dt, lpnDt, receiving.eb.lpn-d-req, eb.receiving.lpn-res-dtls, oe.wcl.palln, oe.wbl.grp-compe, oe.wsbl.grp-cnt, wsbl.oe.stas.tring, wsbl.oe.stas.palln, wsbl.receiving.stas.dck-dor, wsbl.oe.grp-frce-clse, wsbl.srting.stas.trng, wsbl.srting.stas.palln, wsbl.autoputaway.stas.mild, wsbl.oe.dtn-req, oe.wsbl.mild, oe.wcsbl.container-empty, oe.wsbl.vcant
          Replicas: 3
           Storage: File

Options:

         Retention: Limits
   Acknowledgments: true
    Discard Policy: Old
  Duplicate Window: 2m0s
        Direct Get: true
 Allows Msg Delete: true
      Allows Purge: true
    Allows Rollups: true

Limits:

  Maximum Messages: 100,000

Maximum Per Subject: 100,000 Maximum Bytes: 64 MiB Maximum Age: 3d0h0m0s Maximum Message Size: 98 KiB Maximum Consumers: unlimited

Cluster Information:

              Name: nats
            Leader: nats-1
           Replica: nats-0, current, seen 516ms ago
           Replica: nats-2, current, seen 526ms ago

State:

          Messages: 61
             Bytes: 35 KiB
    First Sequence: 83 @ 2024-10-20 20:54:20
     Last Sequence: 143 @ 2024-10-21 11:12:39
  Active Consumers: 14
Number of Subjects: 5`
mohamedsaleem18 commented 1 month ago

Consumer info

`Information for Consumer lpn > regEnCntrStasHdler created 2024-10-08T23:38:49-05:00

Configuration:

                Name: regEnCntrStasHdler
           Pull Mode: true
      Filter Subject: wsbl.receiving.stas.dck-dor
      Deliver Policy: All
          Ack Policy: Explicit
            Ack Wait: 30.00s
       Replay Policy: Instant
     Max Ack Pending: 1,000
   Max Waiting Pulls: 512

Cluster Information:

                Name: nats
              Leader: nats-1
             Replica: nats-0, current, seen 824ms ago
             Replica: nats-2, current, seen 829ms ago

State:

Last Delivered Message: Consumer sequence: 27 Stream sequence: 133 Last delivery: 1h56m11s ago Acknowledgment Floor: Consumer sequence: 27 Stream sequence: 133 Last Ack: 1h56m11s ago Outstanding Acks: 0 out of maximum 1,000 Redelivered Messages: 0 Unprocessed Messages: 0 Waiting Pulls: 0 of maximum 512`

mohamedsaleem18 commented 4 weeks ago

Can you please provide resolution for the issue ?

sourabhaggrawal commented 4 weeks ago

I have also faced this issue with 1 replica pod (non clustered) , WQ stream with no message limitation and no ttl. Consumer just stopped receiving messages and had to reboot the consumer app after which it started consuming messages. nats-server 2.10.11