Open rubencosta opened 1 month ago
Please show the nats
commands with --trace set for the ones that fail
thank you
Please show the
nats
commands with --trace set for the ones that failthank you
Thanks @ripienaar, I have updated the description although there were no extra logs.
Hmm, ok, thats a pity - essential to figure it out, ok I'll try to reproduce and add some debug info. Will move to CLI repo for now
@ripienaar I should add that we have first observed this in our app code using the go client with a cluster deployed in Kubernetes. I used nats
cli for the simplest reproduction case I could manage.
Sample go code would be good also - nats
uses its own client library to it might be a bit weird if I have bugs hehe...but I'll look into it anyway.
Currently the meta layer requires all peers to be online to place. That information is updated async, so could be that meta layer did not process the offline state when it allows.
We have had some requests from customers that would allow for peer selection with just quorum semantics..
@derekcollison Not sure if I understand what you mean, but maybe this helps.
Here we can see that after stopping the node c
, the state is correctly reported in the RAFT Meta Group Information while still being able to reproduce the issue.
╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├────────┬──────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬───────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├────────┼──────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼───────────┤
│ a │ test-cluster │ 1 │ 0 │ 1 │ 24 B │ 24 B │ 0 B │ 0 │ 0 │
│ b* │ test-cluster │ 1 │ 0 │ 1 │ 24 B │ 24 B │ 0 B │ 8 │ 1 / 12.5% │
├────────┼──────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼───────────┤
│ │ │ 2 │ 0 │ 2 │ 48 B │ 48 B │ 0 B │ 8 │ 1 │
╰────────┴──────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴───────────╯
╭───────────────────────────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├─────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Connection Name │ ID │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ a │ GR5IGR3G │ │ true │ true │ 97ms │ 0 │
│ b │ 0ZkM0vRC │ yes │ true │ true │ 0s │ 0 │
│ c │ k1i3jIye │ │ false │ false │ 19.10s │ 1 │
╰─────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯
❯
nats subscribe test --all --trace
19:09:22 Subscribing to JetStream Stream holding messages with subject test starting with the first message received
<<< Reply Subject: $JS.ACK.test.yEK7pGyN.1.1.1.1717693739474051672.0
[#1] Received JetStream message: stream: test seq 1 / subject: test / time: 2024-06-06T19:08:59+02:00
test
^C⏎
❯
nats subscribe test --all --trace
19:09:26 Subscribing to JetStream Stream holding messages with subject test starting with the first message received
nats: error: context deadline exceeded
Is the stream a interest or workqueue stream?
Its trying to create a consumer and I would assume its asking for R3 and can't see all peers online.
Not sure if nats cli would accept it here, but try --replicas 1
.
If interest or WQ it requires same replica count. We are adding in a viewer that bypasses consumers and uses direct gets. @ripienaar has additional information and timing on that.
Only nats s view
will use direct get, for sub its always a subscription. So I think Derek has nailed the reason for this
Thanks @ripienaar and @derekcollison, it seems like we we're still using the deprecated subscribe API and in the process of trying to reproduce the bug I ended up using the nats sub
which also suffers from the same issue. I can confirm that using an ordered consumer works 100% of the time.
Observed behavior
The subscription fails with a timeout about ~33% of the tries. This is unexpected because the cluster should still be functional in the case of a single node failure. From the amount of failures and the fact that removing the server from the cluster fixes the issue, I assume that NATS Server is routing the request internally to any of the cluster nodes even if the node is reported as offline.
Expected behavior
I expect the subscription to work 100% of the tries since the cluster should still be functional in the case of a single node failure.
Server and client version
nats-server: v2.10.16 nats: v0.1.4
Host environment
No response
Steps to reproduce
Create a 3 node cluster with JetStream enabled
Create a test stream
Publish a test message
Create a subscription and observe that it always works
Stop any of the servers, create a subscription and observe a timeout