nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.91k stars 1.41k forks source link

Messages not delivered for sourced stream with Interest policy when destination cluster was down for some time #5031

Open vazhem opened 9 months ago

vazhem commented 9 months ago

Observed behavior

Configuration: streamA(source) -> streamB

Configured supercluster with jetstreams. streamA is hosted on clusterA (3 nodes), streamB is hosted on clusterB (3 nodes). streamA is a source for streamB. streamB has consumers with DeliverAllPolicy, AckExplicitPolicy, MaxDeliver=-1, MaxAckPending=1, InactiveThreshold=-1.

Scenario:

  1. shutdown clusterB
  2. write messages to clusterA (messages are dropped and not wait till streamB comes online).
  3. start clusterB
  4. no messages are delivered to streamB

All messages which are written to clusterA when clusterB is offline are not eventually delivered to streamB when clusterB comes online.

Additional notes:

Expected behavior

Messages should be stored on source streamA when they are written while clusterB is offline. Messages should be delivered to streamB when clusterB comes online.

Server and client version

embedded nats server v2.10.9

Host environment

No response

Steps to reproduce

No response

vazhem commented 8 months ago

Any comments or plans to fix this issue?

derekcollison commented 8 months ago

If the upstream is interest based then the consumer that is pulling for the downstream stream will go away on extended disconnect and messages will be removed.

We plan on taking a look at possibly improving this in 2.11 or 2.12.

Depending on your exact use case there are workarounds that are possible.

vazhem commented 8 months ago

Our use case is to ensure message delivery on remote cluster even in case of extended downtime of a remote cluster. Assumption is that if clusterB is down and cannot consume messages, they should not be removed from the stream on clusterA according to size or count limits on the stream. Currently we need to use Limits retention policy as a workaround, but the drawback of it is that space for old consumed messages is used always even if clusterB is healthy.

derekcollison commented 8 months ago

Yes understood. Again we plan on improving this for sure.

One workaround may be to register a durable consumer with ackAll from the remote cluster B side. It periodically wakes up and checks the stream info of the remote downstream stream and consumes and acks from the origin stream (you can make this do headers only to avoid duplicating all msg traffic twice) then calling ack on the sequence that matches the last sequence of the remote stream.

This will work well until we offer a fix to the issue in the server.