Open casix opened 3 months ago
I set two the leaf nodes servers to trace mode and analyzed the logs.
What I observed is this:
(I tried to clean up the logs. I've left some lines that I think aren't important, but I'm not sure. I deleted lines that I believe are not relevant (I hope I didn't delete any important ones))
When I send a message to the stream with the highest sequence:
Logs from the server that receives the message:
[8924] 2024/08/29 11:11:14.101905 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- [LMSG deliver.cloud.9f4849d5-45ff-403f-a955-3e4042685f72.M.PfgeANUG $JS.ACK.dev-rc_devices_down_9f4849d5-45ff-403f-a955-3e4042685f72.i3UZezSD.1.4064.4061.1724922674167386096.0@dev-rc.devices.down.9f4849d5-45ff-403f-a955-3e4042685f72 4]
[8924] 2024/08/29 11:11:14.101905 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- MSG_PAYLOAD: ["hola"]
[8924] 2024/08/29 11:11:14.102555 [TRC] 127.0.0.1:51140 - cid:36 - ->> [MSG dev-rc.devices.down.9f4849d5-45ff-403f-a955-3e4042685f72 3 $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 4]
[8924] 2024/08/29 11:11:14.104112 [TRC] 127.0.0.1:51140 - cid:36 - <<- [PUB $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
[8924] 2024/08/29 11:11:14.104656 [TRC] 127.0.0.1:51140 - cid:36 - <<- MSG_PAYLOAD: ["+TERM"]
[8924] 2024/08/29 11:11:14.104656 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [LMSG $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
[8924] 2024/08/29 11:11:14.104656 [TRC] 127.0.0.1:51140 - cid:36 - <<- [PUB $JS.API.CONSUMER.MSG.NEXT.down.down_consumer _INBOX.Wpoi6ePfT40yDp1cirdnpL 35]
[8924] 2024/08/29 11:11:14.104656 [TRC] 127.0.0.1:51140 - cid:36 - <<- MSG_PAYLOAD: ["{\"batch\": 1, \"expires\": 4999900000}"]
[8924] 2024/08/29 11:11:14.104656 [DBG] 54.171.110.45:443 - lid_ws:27 - Not permitted to deliver to "$JS.API.CONSUMER.MSG.NEXT.down.down_consumer"
Logs from the other server:
[4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [LMSG $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
[4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["+TERM"]
When I send a message to the stream that does not have the highest sequence:
Logs from the server that receives the message:
[4672] 2024/08/29 11:12:04.711266 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [LMSG deliver.cloud.9a333ee0-b3d3-4393-bf13-320a63dd1b50.M.PXFU256b $JS.ACK.dev-rc_devices_down_9a333ee0-b3d3-4393-bf13-320a63dd1b50.LD5rWYj2.1.4064.4064.1724922724697142996.0@dev-rc.devices.down.9a333ee0-b3d3-4393-bf13-320a63dd1b50 4]
[4672] 2024/08/29 11:12:04.711266 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["hola"]
[4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.zX2vOagH 75 75]
[4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.EhQlL4ZA 75 75]
[4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.WjICBx4N 75 75]
[4672] 2024/08/29 11:12:04.810368 [TRC] 52.30.103.187:443 - lid_ws:21 - ->> [HMSG deliver.9a333ee0-b3d3-4393-bf13-320a63dd1b50.cloud.S.YsTdnk8j 75 75]
[4672] 2024/08/29 11:12:04.856976 [TRC] 127.0.0.1:51085 - cid:26 - ->> [HMSG _INBOX.prq3VgHHPEiLyRURkcw7i1 2 81 81]
[4672] 2024/08/29 11:12:04.918818 [TRC] 127.0.0.1:51085 - cid:26 - <<- [PUB $JS.API.CONSUMER.MSG.NEXT.sessions.events_sessions_consumer _INBOX.prq3VgHHPEiLyRURkcw7i1 35]
[4672] 2024/08/29 11:12:04.918818 [TRC] 127.0.0.1:51085 - cid:26 - <<- MSG_PAYLOAD: ["{\"batch\": 1, \"expires\": 4999900000}"]
[4672] 2024/08/29 11:12:04.918818 [DBG] 52.30.103.187:443 - lid_ws:21 - Not permitted to deliver to "$JS.API.CONSUMER.MSG.NEXT.sessions.events_sessions_consumer"
[4672] 2024/08/29 11:12:05.711545 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [HMSG deliver.cloud.9a333ee0-b3d3-4393-bf13-320a63dd1b50.M.PXFU256b 81 81]
[4672] 2024/08/29 11:12:05.711636 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["NATS/1.0 100 Idle Heartbeat\r\nNats-Last-Consumer: 4064\r\nNats-Last-Stream: 4064\r\n\r\n"]
Logs from the other server:
[8924] 2024/08/29 11:12:04.130224 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- [HMSG deliver.cloud.9f4849d5-45ff-403f-a955-3e4042685f72.M.PfgeANUG 81 81]
[8924] 2024/08/29 11:12:04.130224 [TRC] 54.171.110.45:443 - lid_ws:27 - <<- MSG_PAYLOAD: ["NATS/1.0 100 Idle Heartbeat\r\nNats-Last-Consumer: 4061\r\nNats-Last-Stream: 4064\r\n\r\n"]
[8924] 2024/08/29 11:12:04.130389 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [HMSG deliver.9f4849d5-45ff-403f-a955-3e4042685f72.cloud.S.uGuKHIiG 75 75]
[8924] 2024/08/29 11:12:04.782781 [TRC] 127.0.0.1:51140 - cid:36 - ->> [HMSG _INBOX.Wpoi6ePfT40yDp1cirdnpL 3 81 81]
[8924] 2024/08/29 11:12:04.845309 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [HMSG deliver.9f4849d5-45ff-403f-a955-3e4042685f72.cloud.S.qRFCuY6K 75 75]
[8924] 2024/08/29 11:12:04.845309 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [HMSG deliver.9f4849d5-45ff-403f-a955-3e4042685f72.cloud.S.Yh26ypG9 75 75]
[8924] 2024/08/29 11:12:04.845309 [TRC] 127.0.0.1:51140 - cid:36 - <<- [PUB $JS.API.CONSUMER.MSG.NEXT.down.down_consumer _INBOX.Wpoi6ePfT40yDp1cirdnpL 35]
[8924] 2024/08/29 11:12:04.845309 [TRC] 127.0.0.1:51140 - cid:36 - <<- MSG_PAYLOAD: ["{\"batch\": 1, \"expires\": 4999900000}"]
[8924] 2024/08/29 11:12:04.845309 [DBG] 54.171.110.45:443 - lid_ws:27 - Not permitted to deliver to "$JS.API.CONSUMER.MSG.NEXT.down.down_consumer"
I don't have the expertise to fully analyze this, but maybe it helps.
What seems most suspicious to me is this message:
[4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- [LMSG $JS.ACK.down.down_consumer.1.4064.5.1724922674167386096.0 5]
[4672] 2024/08/29 11:11:14.228083 [TRC] 52.30.103.187:443 - lid_ws:21 - <<- MSG_PAYLOAD: ["+TERM"]
This ack message is received by both servers, but I understand that only the server where the consumer receives the message should receive it. However, I could be wrong...
I made a test using different consumer names on each leaf node. I also had to change the way we consume messages from our Python program to a CLI command. Now, when I send a message to the stream with the highest sequence, the logs are:
[8924] 2024/08/29 13:07:40.204543 [TRC] 127.0.0.1:57693 - cid:50 - ->> [MSG dev-rc.devices.down.9f4849d5-45ff-403f-a955-3e4042685f72 2 $JS.ACK.down.down_consumer_9f4849d5-45ff-403f-a955-3e4042685f72.1.4069.3.1724929631952185043.0 4]
[8924] 2024/08/29 13:07:40.204543 [TRC] 127.0.0.1:57693 - cid:50 - <-> [DELSUB 2]
[8924] 2024/08/29 13:07:40.204543 [TRC] 54.171.110.45:443 - lid_ws:27 - ->> [LS- _INBOX.T3dKjp9BZd0iB8ioQ8TbLO.NKVZp7uj]
[8924] 2024/08/29 13:07:40.205055 [TRC] 127.0.0.1:57693 - cid:50 - <<- [PUB $JS.ACK.down.down_consumer_9f4849d5-45ff-403f-a955-3e4042685f72.1.4069.3.1724929631952185043.0 0]
[8924] 2024/08/29 13:07:40.205055 [TRC] 127.0.0.1:57693 - cid:50 - <<- MSG_PAYLOAD: [""]
[8924] 2024/08/29 13:07:40.205055 [TRC] 127.0.0.1:57693 - cid:50 - <<- [PING]
Another important point is that we are mostly sure this behavior appeared when we changed the consumers from push to pull.
What's happening here appears to be 2 domains with identical stream and consumer names.
The ack messages like $JS.ACK.STREAM.CONSUMER.1.2.2.1725279000278321015.0
are not domain scoped and so the leaf nodes with the same stream and consumer name both get the acks.
Because we don't track what are valid outstanding ACKs and just accept any ACK this causes actual ACKs to happen.
Best option today is to make unique stream names in each domain.
We do have a domain aware ACK format, but at least in my setup and yours that's not used - asking around to find out how that works.
Best option today is to make unique stream names in each domain.
Make unique consumers is valid too? or is better to make unique streams?
Unique consumer would also avoid it yeah
Observed behavior
Consumers increase the delivered last sequence, even though the stream they are connected to hasn't received any messages
Expected behavior
Each consumer only increases its own delivered last sequence when it reads a message from its stream.
Server and client version
In the cloud:
In the leafnodes:
Host environment
Cloud Environment:
IoT Environment:
Stream and Consumer Setup: Each IoT device has the following NATS resources:
Specific Example (for IoT device “Alice”):
Steps to reproduce
Set up an environment as described, and send a message to down_alice with a client connected directly to the cloud. Then, check the delivered last sequence in all consumers.
Side effects
If we have Alice and Bob IoT devices, and we send 10 messages to Alice, both Bob's and Alice's consumers will have their delivered last sequence set to 10. If I now send a message to Bob, the message is lost. Both streams, '__down_bob' and 'down__' (the local one), increase their own last sequence, but Bob's consumer does not change anything.
If I continue sending messages to Bob, the behavior remains the same until the sequence number of a message is 11. At that point, because it is greater than the consumer's delivered last sequence, the consumer reads it. However, Alice's consumer also increases its own last delivered sequence.
I suspect that the messages lost is because the 'down' queue is a workqueue, and since the consumer's last delivered sequence is higher than the sequence number of the message, the consumer assumes the message has already been read.
The stream with the highest sequence number will work (the messages will be read by the consumer), while the other consumers will discard all messages until their stream has the highest sequence.
If you need more information, feel free to ask.
Thanks!