Open cvoica opened 1 year ago
The consumer info shows a lot of outstanding acks Outstanding Acks: 25,000 out of maximum 25,000
and the behaviour looks like a redelivery timeout kicking in while being blocker on outstanding acks.
How are you acking the messages? Can you show some snippets?
This is how I tested
nats con sub --ack md-test cvopush > a
The 25000 Oustanding Acks is setup by me to be similar to my pull consumer. I'm able to process messages at that point where we see 25k outstanding ack. It means that the instant replay is working fine, messages are pushed to the consumer and ack.
In my app I fetch 5000, use AckAll and I ack the last of the 5000. The problem is this: the NATS server is delivering 2x 5000 then I see a timeout on Fetch (default 5s but I played with many others) and after some seconds (20s, 30s) in which I sometimes see something not messages I start getting normally messages. What is also worth mentioning is that during the time the consumer is not getting data (after the initial 2x 5000) I cannot say nats con info on that consumer, I always get timeout
Please have a look at this particular snipped. My understanding is the following
The question that interests me is how to find out what the system does in the meantime? Why does it need so long between 2143 and 2145 if I say it to deliver Instant and it's a push consumer?
2151.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002142.2143.1689967560702828680.1598645"
2152.json:Modify: 2023-07-23 18:50:30.225262431 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKvR"
2153.json:Modify: 2023-07-23 18:50:35.217285152 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKvR"
2154.json:Modify: 2023-07-23 18:50:35.225285189 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKwt"
2155.json:Modify: 2023-07-23 18:50:35.225285189 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKwt"
2156.json:Modify: 2023-07-23 18:50:40.225307947 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKyL"
2157.json:Modify: 2023-07-23 18:50:40.257308092 +0200 "$JS.ACK.md-test.cvopush.1.686002143.2144.1689967560702829747.1598644"
2158.json:Modify: 2023-07-23 18:50:45.217330668 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKyL"
2159.json:Modify: 2023-07-23 18:50:45.225330704 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKzn"
2160.json:Modify: 2023-07-23 18:50:45.225330704 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKzn"
2161.json:Modify: 2023-07-23 18:50:49.825351641 +0200 "$JS.ACK.md-test.cvopush.1.686002144.2145.1689967560702830614.1598643"
2162.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002145.2146.1689967560702831707.1598642"
I have a question related to the Waiting Pulls: 10 of maximum 20 below This output is taken imediatelly before the data starts flowing normally. In all before executions the nats con info fails with context deadline exceeded
I do a Fetch of 100 (first 2 fetches are always ok) and get a timeout after 5s (this timeout is several times, in total until data starts flowing, hence the 10 waiting). Then I process the error (just logging as it might be that the stream really did not got messages in the past 5s). Then I attempt the next Fetch. With this pattern I expect to see only one Fetch active, and this indeed the case when messages are flowing ok. As you can see above, briefly before the messages start flowing we see 10 pulls in the waiting list.
I investigated the nats-server consumer.go code but I'm slow as I'm still learning. I'm trying to understand why in my case the o.waiting list still keeps the Fetch request even if it times out? Perhaps this helps to narrow down the investigation process
$ nats con info md-test sGd5PWmI
Information for Consumer md-test > sGd5PWmI created 2023-07-25T10:15:58+02:00
Configuration:
Pull Mode: true
Filter Subject: md-test
Deliver Policy: From Sequence 1100000500
Ack Policy: All
Ack Wait: 30s
Replay Policy: Instant
Max Ack Pending: 300
Max Waiting Pulls: 20
Cluster Information:
Name: env-sim
Leader: nats-a
State:
Last Delivered Message: Consumer sequence: 800 Stream sequence: 1,100,001,299 Last delivery: 0.01s ago
Acknowledgment floor: Consumer sequence: 500 Stream sequence: 1,100,000,999 Last Ack: 0.01s ago
Outstanding Acks: 300 out of maximum 300
Redelivered Messages: 0
Unprocessed Messages: 50,182,503
Waiting Pulls: 10 of maximum 20
@Jarema I'm currently suspecting that the rotation of the stream is making this behavior visible. I did configured a stream with 4h rotating and it has only 5GB. The issue is easily reproducible like this, at least in our NATS cluster. The Fetch operations that are working are followed by a "stall" period of couple of seconds, depending on now many full rotations it had. I created that stream on Monday and it was having small stalls after 8h but they got to almost 3s on Wednesday.
During these 3s even "nats con info" does not work, it takes longer. Perhaps you have some suggestion about how to approach this investigation to confirm (or not) this?
What version of NATS server are you running?
Version: 2.9.19 I installed 2.19.20 and it's the same behavior
During my test I strace the nats-server and see that is waiting for a mutex in the period when I do not get data from it.
16:47:57.228729 futex(0xc000142148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
16:48:01.106358 epoll_pwait(3, [], 128, 0, NULL, 151887781672611) = 0
I do not have time now to continue my analysis but I will be back to this issue once I'm getting more time
@Jarema I had time to check again this issue. Making more tests I noticed that it works fine if I use AckNone or AckExplicit. The question I have now is if the AckAll is supposed to work with PullConsumer?
I'm having a problem in my cluster with several streams when I recover(by seq or time): the consumers get some messages then I see timeouts (5s default) couple of times before data starts flowing as usual.
NATS pushed to the consumer 2143 messages, they were ack and then nats cli asked for CONSUMER.MSG.NEXT several times, with timeouts
18:50:30.225262431 18:50:35.217285152 18:50:40.225307947 Then we see 686.002.143 ack at 18:50:40.257308092 which is the 2144 messaga the consumer sees The again timeout 18:50:45.225330704 Then the 18:50:45.225330704 managed to get some data as we see the 18:50:49.825351641 ACK starts to flow again with 686002144.2145
Please let me know how could I investigate further what exactly is going on during those 5s and why it starts working after a while.
This output is based on the nats con sub --trace --dump=. and me extracting the modification time of the file and some fields
for f in $(ls | sort -n); do m=$(stat $f | grep Modify); echo $f":"$m" "$(cat $f| jq '.Subject '+' .Reply') | tee -a Modify.Subject.Reply.log; done
The stream
I have this issue in my own application (Pull consumer) but here is a test with nats cli and a push consumer