nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.04k stars 1.42k forks source link

Consumer Update Delivered Last Sequence Behavior [v2.10.17] #5844

Open casix opened 3 months ago

casix commented 3 months ago

Observed behavior

Consumers increase the delivered last sequence, even though the stream they are connected to hasn't received any messages

Expected behavior

Each consumer only increases its own delivered last sequence when it reads a message from its stream.

Server and client version

In the cloud:

In the leafnodes:

Host environment

Cloud Environment:

IoT Environment:

Stream and Consumer Setup: Each IoT device has the following NATS resources:

Specific Example (for IoT device “Alice”):

Steps to reproduce

Set up an environment as described, and send a message to down_alice with a client connected directly to the cloud. Then, check the delivered last sequence in all consumers.

Side effects

If we have Alice and Bob IoT devices, and we send 10 messages to Alice, both Bob's and Alice's consumers will have their delivered last sequence set to 10. If I now send a message to Bob, the message is lost. Both streams, '__down_bob' and 'down__' (the local one), increase their own last sequence, but Bob's consumer does not change anything.

If I continue sending messages to Bob, the behavior remains the same until the sequence number of a message is 11. At that point, because it is greater than the consumer's delivered last sequence, the consumer reads it. However, Alice's consumer also increases its own last delivered sequence.

I suspect that the messages lost is because the 'down' queue is a workqueue, and since the consumer's last delivered sequence is higher than the sequence number of the message, the consumer assumes the message has already been read.

The stream with the highest sequence number will work (the messages will be read by the consumer), while the other consumers will discard all messages until their stream has the highest sequence.

If you need more information, feel free to ask.

Thanks!

casix commented 3 months ago

I set two the leaf nodes servers to trace mode and analyzed the logs.

What I observed is this:

(I tried to clean up the logs. I've left some lines that I think aren't important, but I'm not sure. I deleted lines that I believe are not relevant (I hope I didn't delete any important ones))

I don't have the expertise to fully analyze this, but maybe it helps.

What seems most suspicious to me is this message: [4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [LMSG $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5] [4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["+TERM"]

This ack message is received by both servers, but I understand that only the server where the consumer receives the message should receive it. However, I could be wrong...

casix commented 3 months ago

I made a test using different consumer names on each leaf node. I also had to change the way we consume messages from our Python program to a CLI command. Now, when I send a message to the stream with the highest sequence, the logs are:

Another important point is that we are mostly sure this behavior appeared when we changed the consumers from push to pull.

ripienaar commented 2 months ago

What's happening here appears to be 2 domains with identical stream and consumer names.

The ack messages like $JS.ACK.STREAM.CONSUMER.1.2.2.1725279000278321015.0 are not domain scoped and so the leaf nodes with the same stream and consumer name both get the acks.

Because we don't track what are valid outstanding ACKs and just accept any ACK this causes actual ACKs to happen.

ripienaar commented 2 months ago

Best option today is to make unique stream names in each domain.

ripienaar commented 2 months ago

We do have a domain aware ACK format, but at least in my setup and yours that's not used - asking around to find out how that works.

casix commented 2 months ago

Best option today is to make unique stream names in each domain.

Make unique consumers is valid too? or is better to make unique streams?

ripienaar commented 2 months ago

Unique consumer would also avoid it yeah