nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.87k stars 1.4k forks source link

Reading messages as consumer sometimes slows down suddenly. #5659

Open yuzhou-nj opened 3 months ago

yuzhou-nj commented 3 months ago

Observed behavior

Hi, I have developed a "consumer program", which reads messages from stream in pull mode. A maximum of 200 messages can be read at a time, and then sent to nginx through http+post messages. It is found that the number of messages read sometimes decreases and can be recovered automatically after a period of time. This problem starts to occur at nats-server v2.10.16.

There is no abnormal log in nat-server.log. When the number of read messages decreases, I can see through the nats stream state XXX that there are messages being deleted.

Thank you for your help.

Expected behavior

The consumer should read messages stably and continuously.

Server and client version

./nats-server -v

nats-server: v2.10.16

Host environment

CentOS 8

Steps to reproduce

stream info:

# nats s info user_cache
Information for Stream user_cache created 2023-06-21 17:05:11

              Subjects: user_cache.>
              Replicas: 3
               Storage: File

Options:

             Retention: Interest
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: 20,000,000
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 7d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats_cluster
                Leader: nats0
               Replica: nats1, current, seen 891ms ago
               Replica: nats2, current, seen 891ms ago

State:

              Messages: 5,000,000
                 Bytes: 5.0 GiB
        First Sequence: 60,277,152 @ 2024-07-16 19:06:53 UTC
         Last Sequence: 65,277,151 @ 2024-07-16 19:08:18 UTC
      Active Consumers: 1
    Number of Subjects: 1

consumer info:

# nats c report user_cache
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                     Consumer report for user_cache with 1 consumers                                         │
├───────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬──────────────────┬────────────┬──────────────────────┤
│ Consumer  │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed      │ Ack Floor  │ Cluster              │
├───────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼──────────────────┼────────────┼──────────────────────┤
│ natssync2 │ Pull │ Explicit   │ 30.00s   │ 0           │ 0           │ 5,000,000 / 100% │ 60,277,151 │ nats0, nats1*, nats2 │
╰───────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴──────────────────┴────────────┴──────────────────────╯

In nats-server v2.10.14 and earlier versions, the number of messages consumed per second is stable.

2024/07/16 19:05:32.201262 natssync2 user_cache: LogStat totalsucc=4706832 totalfail=0 succ=12588 fail=0 opertkey_num=0
2024/07/16 19:05:33.201295 natssync2 user_cache: LogStat totalsucc=4719800 totalfail=0 succ=12968 fail=0 opertkey_num=0
2024/07/16 19:05:34.201825 natssync2 user_cache: LogStat totalsucc=4732200 totalfail=0 succ=12400 fail=0 opertkey_num=0
2024/07/16 19:05:35.202032 natssync2 user_cache: LogStat totalsucc=4744800 totalfail=0 succ=12600 fail=0 opertkey_num=0
2024/07/16 19:05:36.202148 natssync2 user_cache: LogStat totalsucc=4757400 totalfail=0 succ=12600 fail=0 opertkey_num=0
2024/07/16 19:05:37.202182 natssync2 user_cache: LogStat totalsucc=4770290 totalfail=0 succ=12890 fail=0 opertkey_num=0
2024/07/16 19:05:38.202221 natssync2 user_cache: LogStat totalsucc=4782800 totalfail=0 succ=12508 fail=0 opertkey_num=0
2024/07/16 19:05:39.205999 natssync2 user_cache: LogStat totalsucc=4797600 totalfail=0 succ=14800 fail=0 opertkey_num=0
2024/07/16 19:05:40.206041 natssync2 user_cache: LogStat totalsucc=4812277 totalfail=0 succ=14677 fail=0 opertkey_num=0
2024/07/16 19:05:41.207710 natssync2 user_cache: LogStat totalsucc=4827839 totalfail=0 succ=15559 fail=0 opertkey_num=0
2024/07/16 19:05:42.208057 natssync2 user_cache: LogStat totalsucc=4840280 totalfail=0 succ=12439 fail=0 opertkey_num=0
2024/07/16 19:05:43.208128 natssync2 user_cache: LogStat totalsucc=4853222 totalfail=0 succ=12942 fail=0 opertkey_num=0
2024/07/16 19:05:44.208224 natssync2 user_cache: LogStat totalsucc=4865400 totalfail=0 succ=12177 fail=0 opertkey_num=0
2024/07/16 19:05:45.208259 natssync2 user_cache: LogStat totalsucc=4877846 totalfail=0 succ=12446 fail=0 opertkey_num=0
2024/07/16 19:05:46.209216 natssync2 user_cache: LogStat totalsucc=4889798 totalfail=0 succ=11950 fail=0 opertkey_num=0

On nats-server v2.10.16 and 2.10.17, the number of consumed messages fluctuates. (look 19:18:50~55, succ=)

2024/07/16 19:18:43.280567 natssync2  user_cache: LogStat totalsucc=1175348 totalfail=0 succ=11148 fail=0 opertkey_num=0
2024/07/16 19:18:44.280861 natssync2  user_cache: LogStat totalsucc=1187185 totalfail=0 succ=11837 fail=0 opertkey_num=0
2024/07/16 19:18:45.280901 natssync2  user_cache: LogStat totalsucc=1199124 totalfail=0 succ=11940 fail=0 opertkey_num=0
2024/07/16 19:18:46.281127 natssync2  user_cache: LogStat totalsucc=1211669 totalfail=0 succ=12544 fail=0 opertkey_num=0
2024/07/16 19:18:47.281431 natssync2  user_cache: LogStat totalsucc=1223600 totalfail=0 succ=11929 fail=0 opertkey_num=0
2024/07/16 19:18:48.283433 natssync2  user_cache: LogStat totalsucc=1234801 totalfail=0 succ=11201 fail=0 opertkey_num=0
2024/07/16 19:18:49.283868 natssync2  user_cache: LogStat totalsucc=1239000 totalfail=0 succ=4199 fail=0 opertkey_num=0
2024/07/16 19:18:50.284069 natssync2  user_cache: LogStat totalsucc=1239200 totalfail=0 succ=200 fail=0 opertkey_num=0
2024/07/16 19:18:51.286809 natssync2  user_cache: LogStat totalsucc=1239600 totalfail=0 succ=400 fail=0 opertkey_num=0
2024/07/16 19:18:52.287828 natssync2  user_cache: LogStat totalsucc=1239800 totalfail=0 succ=200 fail=0 opertkey_num=0
2024/07/16 19:18:53.288306 natssync2  user_cache: LogStat totalsucc=1240600 totalfail=0 succ=800 fail=0 opertkey_num=0
2024/07/16 19:18:54.288766 natssync2  user_cache: LogStat totalsucc=1241200 totalfail=0 succ=600 fail=0 opertkey_num=0
2024/07/16 19:18:55.289730 natssync2  user_cache: LogStat totalsucc=1245800 totalfail=0 succ=4600 fail=0 opertkey_num=0
2024/07/16 19:18:56.289902 natssync2  user_cache: LogStat totalsucc=1257400 totalfail=0 succ=11600 fail=0 opertkey_num=0
2024/07/16 19:18:57.289951 natssync2  user_cache: LogStat totalsucc=1269230 totalfail=0 succ=11830 fail=0 opertkey_num=0
2024/07/16 19:18:58.289999 natssync2  user_cache: LogStat totalsucc=1281459 totalfail=0 succ=12227 fail=0 opertkey_num=0

Thank you very much!

neilalexander commented 3 months ago

Please post the result of nats consumer info user_cache natssync2.

yuzhou-nj commented 3 months ago

Especially when messages are published and consumed at the same time, consumers usually cannot read messages or rarely read messages.

I published messages like this:

nats bench "user_cache.default" --pub 3 --size="1024" --msgs=3000000 

The log is:

2024/07/16 20:19:42.761793 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=21.021926ms
2024/07/16 20:19:42.789626 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=20.166502ms
2024/07/16 20:19:42.814555 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=20.083135ms
2024/07/16 20:19:42.839620 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=20.444267ms
2024/07/16 20:19:42.865367 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=21.316913ms
2024/07/16 20:19:42.890803 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=21.080772ms
2024/07/16 20:19:43.117156 user_cache: LogStat totalsucc=3224355 totalfail=7380 succ=6185 fail=0 opertkey_num=0
2024/07/16 20:19:44.118286 user_cache: LogStat totalsucc=3224355 totalfail=7386 succ=0 fail=6 opertkey_num=0  <-- No message is read. In fact, there are many messages.
2024/07/16 20:19:45.118401 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=15 opertkey_num=0
2024/07/16 20:19:46.118867 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:47.119556 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:48.119782 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:49.120147 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:50.120357 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:51.120815 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:52.121587 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:53.121727 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:54.122866 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:54.786908 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=1.891032717s  <-- wait too long
2024/07/16 20:19:54.805761 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=11.227657ms
2024/07/16 20:19:55.123079 user_cache: LogStat totalsucc=3224753 totalfail=7401 succ=398 fail=0 opertkey_num=0
2024/07/16 20:19:56.124053 user_cache: LogStat totalsucc=3224753 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:57.125054 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=2 opertkey_num=0
2024/07/16 20:19:58.126125 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:59.126320 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:00.126927 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:01.127135 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:02.127650 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:03.127850 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:04.128093 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:05.129193 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:06.130280 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:07.130882 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:08.131403 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:09.131953 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:10.132468 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:11.132621 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:11.487455 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=6.674830837s  <-- wait too long
2024/07/16 20:20:11.506314 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=9.784367ms
2024/07/16 20:20:11.526660 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=13.911119ms
2024/07/16 20:20:11.545752 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=9.892692ms
yuzhou-nj commented 3 months ago

nats consumer info user_cache natssync2

# nats c info user_cache natssync2      
Information for Consumer user_cache > natssync2 created 2024-07-16T16:26:43+08:00

Configuration:

                    Name: natssync2
               Pull Mode: true
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 30.00s
           Replay Policy: Instant
         Max Ack Pending: 1,000
       Max Waiting Pulls: 512

Cluster Information:

                    Name: nats_cluster
                  Leader: nats2
                 Replica: nats0, current, seen 78ms ago
                 Replica: nats1, current, seen 78ms ago

State:

  Last Delivered Message: Consumer sequence: 33,008,777 Stream sequence: 65,277,151
    Acknowledgment Floor: Consumer sequence: 33,008,777 Stream sequence: 65,277,151
        Outstanding Acks: 0 out of maximum 1,000
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 0 of maximum 512

Thank you!

neilalexander commented 3 months ago

Is the publishing rate more stable on 2.10.18?

kozlovic commented 3 months ago

@yuzhou-nj I am not dismissing that you report a difference in consumer rate between several versions, but saying that the consumer rate is affected, especially when publishing is no surprise. You are adding extra load to the server, especially that you are using the bench tool that does not use JetStream and sends NATS messages as fast as it can.

yuzhou-nj commented 3 months ago

When there are a large number of messages in the stream, just reading and acknowledging the messages, even if no new messages are being inserted, this issue still exists. It seems that because the stream was deleting a large number of messages at that time, the client was unable to read the messages or the number of messages read decreased

yuzhou-nj commented 3 months ago

Is the publishing rate more stable on 2.10.18?

The speed of reading messages is unstable, not publish. I also tried nats-server v2.10.18, and the issue still exists; however, versions v2.10.5 to v2.10.14 do not have this problem.

jnmoyne commented 3 months ago

Note that you should not publish to JetStream with nats bench unless using the --js flag as well (otherwise you are doing Core NATS publications, which could potentially overwhelm the nats-server in charge of persisting the messages for that stream).

kozlovic commented 3 months ago

@yuzhou-nj I wonder if this issue is similar to https://github.com/nats-io/nats-server/issues/5702. I have investigated and reported my findings to rest of the eng team.

kozlovic commented 3 months ago

@yuzhou-nj You may want to re-run your tests with a build of the nats-server that contains https://github.com/nats-io/nats-server/pull/5719, which should be available in the nightly tonight (PST).