nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.81k stars 1.4k forks source link

Fetch time out with DeliverByStartSequence and ReplayInstant consumer #4334

Open cvoica opened 1 year ago

cvoica commented 1 year ago

I'm having a problem in my cluster with several streams when I recover(by seq or time): the consumers get some messages then I see timeouts (5s default) couple of times before data starts flowing as usual.

NATS pushed to the consumer 2143 messages, they were ack and then nats cli asked for CONSUMER.MSG.NEXT several times, with timeouts

18:50:30.225262431 18:50:35.217285152 18:50:40.225307947 Then we see 686.002.143 ack at 18:50:40.257308092 which is the 2144 messaga the consumer sees The again timeout 18:50:45.225330704 Then the 18:50:45.225330704 managed to get some data as we see the 18:50:49.825351641 ACK starts to flow again with 686002144.2145

Please let me know how could I investigate further what exactly is going on during those 5s and why it starts working after a while.

This output is based on the nats con sub --trace --dump=. and me extracting the modification time of the file and some fields for f in $(ls | sort -n); do m=$(stat $f | grep Modify); echo $f":"$m" "$(cat $f| jq '.Subject '+' .Reply') | tee -a Modify.Subject.Reply.log; done

2143.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002134.2135.1689967560702816418.1598653"
2144.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002135.2136.1689967560702820314.1598652"
2145.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002136.2137.1689967560702821410.1598651"
2146.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002137.2138.1689967560702822541.1598650"
2147.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002138.2139.1689967560702823569.1598649"
2148.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002139.2140.1689967560702824541.1598648"
2149.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002140.2141.1689967560702826550.1598647"
2150.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002141.2142.1689967560702827817.1598646"
2151.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002142.2143.1689967560702828680.1598645"
2152.json:Modify: 2023-07-23 18:50:30.225262431 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKvR"
2153.json:Modify: 2023-07-23 18:50:35.217285152 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKvR"
2154.json:Modify: 2023-07-23 18:50:35.225285189 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKwt"
2155.json:Modify: 2023-07-23 18:50:35.225285189 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKwt"
2156.json:Modify: 2023-07-23 18:50:40.225307947 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKyL"
2157.json:Modify: 2023-07-23 18:50:40.257308092 +0200 "$JS.ACK.md-test.cvopush.1.686002143.2144.1689967560702829747.1598644"
2158.json:Modify: 2023-07-23 18:50:45.217330668 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKyL"
2159.json:Modify: 2023-07-23 18:50:45.225330704 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKzn"
2160.json:Modify: 2023-07-23 18:50:45.225330704 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKzn"
2161.json:Modify: 2023-07-23 18:50:49.825351641 +0200 "$JS.ACK.md-test.cvopush.1.686002144.2145.1689967560702830614.1598643"
2162.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002145.2146.1689967560702831707.1598642"
2163.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002146.2147.1689967560702832729.1598641"
2164.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002147.2148.1689967560702833575.1598640"
2165.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002148.2149.1689967560702834523.1598639"
2166.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002149.2150.1689967560702835397.1598638"
2167.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002150.2151.1689967560703843123.1598637"
2168.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002151.2152.1689967560703846937.1598636"
2169.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002152.2153.1689967560703850316.1598635"
2170.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002153.2154.1689967560703851563.1598634"

The stream

$ nats stream info md-test
Information for Stream md-test created 2023-07-05T14:03:16+02:00

Configuration:

             Subjects: md-test
     Acknowledgements: true
            Retention: File - Limits
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: 100 GiB
          Maximum Age: 2d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: env-sim
               Leader: nats-c
              Replica: nats-a, current, seen 0.96s ago
              Replica: nats-b, current, seen 0.96s ago

State:

             Messages: 6,757,750
                Bytes: 2.1 GiB
             FirstSeq: 680,843,038 @ 2023-07-21T16:59:07 UTC
              LastSeq: 687,600,787 @ 2023-07-21T20:27:12 UTC
     Active Consumers: 2

I have this issue in my own application (Pull consumer) but here is a test with nats cli and a push consumer

$ nats con info md-test cvopush
Information for Consumer md-test > cvopush created 2023-07-23T18:50:01+02:00

Configuration:

        Durable Name: cvopush
    Delivery Subject: cvopush
      Filter Subject: md-test
      Deliver Policy: From Sequence 686000000
          Ack Policy: All
            Ack Wait: 30s
       Replay Policy: Instant
     Max Ack Pending: 25,000
        Flow Control: false

Cluster Information:

                Name: env-sim
              Leader: nats-c
             Replica: nats-a, current, seen 0.45s ago
             Replica: nats-b, current, seen 0.45s ago

State:

   Last Delivered Message: Consumer sequence: 53,139 Stream sequence: 686,053,138 Last delivery: 8m58s ago
     Acknowledgment floor: Consumer sequence: 28,139 Stream sequence: 686,028,138 Last Ack: 8m58s ago
         Outstanding Acks: 25,000 out of maximum 25,000
     Redelivered Messages: 0
     Unprocessed Messages: 1,547,649
          Active Interest: No interest
Jarema commented 1 year ago

The consumer info shows a lot of outstanding acks Outstanding Acks: 25,000 out of maximum 25,000 and the behaviour looks like a redelivery timeout kicking in while being blocker on outstanding acks.

How are you acking the messages? Can you show some snippets?

cvoica commented 1 year ago

This is how I tested

nats con sub --ack md-test cvopush > a

The 25000 Oustanding Acks is setup by me to be similar to my pull consumer. I'm able to process messages at that point where we see 25k outstanding ack. It means that the instant replay is working fine, messages are pushed to the consumer and ack.

In my app I fetch 5000, use AckAll and I ack the last of the 5000. The problem is this: the NATS server is delivering 2x 5000 then I see a timeout on Fetch (default 5s but I played with many others) and after some seconds (20s, 30s) in which I sometimes see something not messages I start getting normally messages. What is also worth mentioning is that during the time the consumer is not getting data (after the initial 2x 5000) I cannot say nats con info on that consumer, I always get timeout

Please have a look at this particular snipped. My understanding is the following

The question that interests me is how to find out what the system does in the meantime? Why does it need so long between 2143 and 2145 if I say it to deliver Instant and it's a push consumer?

2151.json:Modify: 2023-07-23 18:50:29.441258863 +0200 "$JS.ACK.md-test.cvopush.1.686002142.2143.1689967560702828680.1598645"
2152.json:Modify: 2023-07-23 18:50:30.225262431 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKvR"
2153.json:Modify: 2023-07-23 18:50:35.217285152 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKvR"
2154.json:Modify: 2023-07-23 18:50:35.225285189 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKwt"
2155.json:Modify: 2023-07-23 18:50:35.225285189 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKwt"
2156.json:Modify: 2023-07-23 18:50:40.225307947 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKyL"
2157.json:Modify: 2023-07-23 18:50:40.257308092 +0200 "$JS.ACK.md-test.cvopush.1.686002143.2144.1689967560702829747.1598644"
2158.json:Modify: 2023-07-23 18:50:45.217330668 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKyL"
2159.json:Modify: 2023-07-23 18:50:45.225330704 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKzn"
2160.json:Modify: 2023-07-23 18:50:45.225330704 +0200 "$JS.API.CONSUMER.MSG.NEXT.md-test.8PpJELfd_INBOX.kvg7TUNG17nQI1ZXqvXAOD.kvg7TUNG17nQI1ZXqvXKzn"
2161.json:Modify: 2023-07-23 18:50:49.825351641 +0200 "$JS.ACK.md-test.cvopush.1.686002144.2145.1689967560702830614.1598643"
2162.json:Modify: 2023-07-23 18:50:49.833351678 +0200 "$JS.ACK.md-test.cvopush.1.686002145.2146.1689967560702831707.1598642"
cvoica commented 1 year ago

I have a question related to the Waiting Pulls: 10 of maximum 20 below This output is taken imediatelly before the data starts flowing normally. In all before executions the nats con info fails with context deadline exceeded

I do a Fetch of 100 (first 2 fetches are always ok) and get a timeout after 5s (this timeout is several times, in total until data starts flowing, hence the 10 waiting). Then I process the error (just logging as it might be that the stream really did not got messages in the past 5s). Then I attempt the next Fetch. With this pattern I expect to see only one Fetch active, and this indeed the case when messages are flowing ok. As you can see above, briefly before the messages start flowing we see 10 pulls in the waiting list.

I investigated the nats-server consumer.go code but I'm slow as I'm still learning. I'm trying to understand why in my case the o.waiting list still keeps the Fetch request even if it times out? Perhaps this helps to narrow down the investigation process

$ nats con info md-test sGd5PWmI
Information for Consumer md-test > sGd5PWmI created 2023-07-25T10:15:58+02:00

Configuration:

           Pull Mode: true
      Filter Subject: md-test
      Deliver Policy: From Sequence 1100000500
          Ack Policy: All
            Ack Wait: 30s
       Replay Policy: Instant
     Max Ack Pending: 300
   Max Waiting Pulls: 20

Cluster Information:

                Name: env-sim
              Leader: nats-a

State:

   Last Delivered Message: Consumer sequence: 800 Stream sequence: 1,100,001,299 Last delivery: 0.01s ago
     Acknowledgment floor: Consumer sequence: 500 Stream sequence: 1,100,000,999 Last Ack: 0.01s ago
         Outstanding Acks: 300 out of maximum 300
     Redelivered Messages: 0
     Unprocessed Messages: 50,182,503
            Waiting Pulls: 10 of maximum 20
cvoica commented 1 year ago

@Jarema I'm currently suspecting that the rotation of the stream is making this behavior visible. I did configured a stream with 4h rotating and it has only 5GB. The issue is easily reproducible like this, at least in our NATS cluster. The Fetch operations that are working are followed by a "stall" period of couple of seconds, depending on now many full rotations it had. I created that stream on Monday and it was having small stalls after 8h but they got to almost 3s on Wednesday.

During these 3s even "nats con info" does not work, it takes longer. Perhaps you have some suggestion about how to approach this investigation to confirm (or not) this?

Jarema commented 1 year ago

What version of NATS server are you running?

cvoica commented 1 year ago

Version: 2.9.19 I installed 2.19.20 and it's the same behavior

cvoica commented 1 year ago

image During my test I strace the nats-server and see that is waiting for a mutex in the period when I do not get data from it.

16:47:57.228729 futex(0xc000142148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
16:48:01.106358 epoll_pwait(3, [], 128, 0, NULL, 151887781672611) = 0

I do not have time now to continue my analysis but I will be back to this issue once I'm getting more time

cvoica commented 1 year ago

@Jarema I had time to check again this issue. Making more tests I noticed that it works fine if I use AckNone or AckExplicit. The question I have now is if the AckAll is supposed to work with PullConsumer?