nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.99k stars 1.41k forks source link

Messages getting dropped during network outage when leaf node stream with work queue policy is aggregated on remote stream #3207

Open bjurr opened 2 years ago

bjurr commented 2 years ago

Defect

When a leaf node has a stream configured with work queue policy and this stream is aggregated on a remote stream, there is a chance messages will be silently dropped if a network outage occurs between the leaf node and the cluster (especially obvious using wifi).

Versions of nats-server and affected client libraries used:

OS/Container environment:

Server:

Client/Leaf node:

Steps or code to reproduce the issue:

Testing this is a somewhat lengthy manual process.

docker run --rm -ti -v $(pwd)/nats.config:/etc/nats/nats-server.conf --name=nats-server -p 4222:4222 -p 6222:6222 -p 8222:8222 -p 7422:7422 nats -m 8222 -c /etc/nats/nats-server.conf

On a separate computer, launch a leaf node with JetStream enabled using the following sample configuration (save as nats-leaf.config):

listen: "0.0.0.0:4222"
server_name: leaf-server1
jetstream {
        domain = leaf1
}
leafnodes {
    remotes = [
        {
          url: "nats-leaf://<ip-of-remote-server>:7422"
        },
    ]
}
docker run --rm -ti -v $(pwd)/nats-leaf.config:/etc/nats/nats-server.conf --name=nats-leaf1 -p 4222:4222 -p 6222:6222 -p 8222:8222 nats -m 8222 -c /etc/nats/nats-server.conf
nats --server=127.0.0.1 stream add --config=leaf-stream.json
nats --server=<ip-of-remote-server> stream add --config=remote-stream.json

Expected result:

Messages should be persisted on the leaf node stream when the remote stream is not reachable.

Actual result:

Messages destined for the leaf node stream are dropped until the leaf node realize it is disconnected from the cluster. Only at that point will messages start being persisted in the stream.

derekcollison commented 2 years ago

Currently we use R1 ephemerals which default to an idle cleanup of 5s IIRC.

My thought here would be to allow this to be configurable for downstream streams when configuring a mirror or source.

PKuchibhatla commented 1 year ago

has this been reolved

derekcollison commented 1 year ago

No extended outages of the leafnode connection could cause issues since the message may be deleted locally if interest policy based.

kuriboww commented 8 months ago

This appears to be an issue I am facing as well. Please investigate with network latency and jitter.

leandrofars commented 1 month ago

Facing the same issue here

leandrofars commented 1 month ago

Is there a way to keep messages in the stream of a leafnode even when lost connection with remote nats-server?

jnmoyne commented 1 month ago

Is there a way to keep messages in the stream of a leafnode even when lost connection with remote nats-server?

Yes: use a limit (rather than work queue or interest) stream to source from, e.g. set a 'max age' limit that is as long as the longest network outage that you want to be able to recover from.