Messages not delivered for sourced stream with Interest policy when destination cluster was down for some time

vazhem commented 9 months ago

Observed behavior

Configuration: streamA(source) -> streamB

Configured supercluster with jetstreams. streamA is hosted on clusterA (3 nodes), streamB is hosted on clusterB (3 nodes). streamA is a source for streamB. streamB has consumers with DeliverAllPolicy, AckExplicitPolicy, MaxDeliver=-1, MaxAckPending=1, InactiveThreshold=-1.

Scenario:

shutdown clusterB
write messages to clusterA (messages are dropped and not wait till streamB comes online).
start clusterB
no messages are delivered to streamB

All messages which are written to clusterA when clusterB is offline are not eventually delivered to streamB when clusterB comes online.

Additional notes:

when clusterB is offline, nats cli reports WARNING: No cluster meta leader found. The cluster expects 6 nodes but only 3 responded. JetStream operation require at least 4 up nodes.
when clusterB is offline, nodes on clusterA initiate jetstream shutdown server/jetstream.go:973 JetStream Shutdown. Seems that replicated consumer is removed during jetstream shutdown which causes engine to think that there is no interested consumer for messages which are written in this state.
this scenario works fine with Limits retention policy

Expected behavior

Messages should be stored on source streamA when they are written while clusterB is offline. Messages should be delivered to streamB when clusterB comes online.

Server and client version

embedded nats server v2.10.9

Host environment

No response

Steps to reproduce

No response

vazhem commented 8 months ago

Any comments or plans to fix this issue?

derekcollison commented 8 months ago

If the upstream is interest based then the consumer that is pulling for the downstream stream will go away on extended disconnect and messages will be removed.

We plan on taking a look at possibly improving this in 2.11 or 2.12.

Depending on your exact use case there are workarounds that are possible.

vazhem commented 8 months ago

Our use case is to ensure message delivery on remote cluster even in case of extended downtime of a remote cluster. Assumption is that if clusterB is down and cannot consume messages, they should not be removed from the stream on clusterA according to size or count limits on the stream. Currently we need to use Limits retention policy as a workaround, but the drawback of it is that space for old consumed messages is used always even if clusterB is healthy.

derekcollison commented 8 months ago

Yes understood. Again we plan on improving this for sure.

One workaround may be to register a durable consumer with ackAll from the remote cluster B side. It periodically wakes up and checks the stream info of the remote downstream stream and consumes and acks from the origin stream (you can make this do headers only to avoid duplicating all msg traffic twice) then calling ack on the sequence that matches the last sequence of the remote stream.

This will work well until we offer a fix to the issue in the server.

nats-io / nats-server