Open yuzhou-nj opened 3 months ago
Please post the result of nats consumer info user_cache natssync2
.
Especially when messages are published and consumed at the same time, consumers usually cannot read messages or rarely read messages.
I published messages like this:
nats bench "user_cache.default" --pub 3 --size="1024" --msgs=3000000
The log is:
2024/07/16 20:19:42.761793 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=21.021926ms
2024/07/16 20:19:42.789626 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=20.166502ms
2024/07/16 20:19:42.814555 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=20.083135ms
2024/07/16 20:19:42.839620 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=20.444267ms
2024/07/16 20:19:42.865367 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=21.316913ms
2024/07/16 20:19:42.890803 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=21.080772ms
2024/07/16 20:19:43.117156 user_cache: LogStat totalsucc=3224355 totalfail=7380 succ=6185 fail=0 opertkey_num=0
2024/07/16 20:19:44.118286 user_cache: LogStat totalsucc=3224355 totalfail=7386 succ=0 fail=6 opertkey_num=0 <-- No message is read. In fact, there are many messages.
2024/07/16 20:19:45.118401 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=15 opertkey_num=0
2024/07/16 20:19:46.118867 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:47.119556 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:48.119782 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:49.120147 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:50.120357 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:51.120815 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:52.121587 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:53.121727 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:54.122866 user_cache: LogStat totalsucc=3224355 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:54.786908 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=1.891032717s <-- wait too long
2024/07/16 20:19:54.805761 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=11.227657ms
2024/07/16 20:19:55.123079 user_cache: LogStat totalsucc=3224753 totalfail=7401 succ=398 fail=0 opertkey_num=0
2024/07/16 20:19:56.124053 user_cache: LogStat totalsucc=3224753 totalfail=7401 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:57.125054 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=2 opertkey_num=0
2024/07/16 20:19:58.126125 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:19:59.126320 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:00.126927 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:01.127135 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:02.127650 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:03.127850 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:04.128093 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:05.129193 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:06.130280 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:07.130882 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:08.131403 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:09.131953 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:10.132468 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:11.132621 user_cache: LogStat totalsucc=3224753 totalfail=7403 succ=0 fail=0 opertkey_num=0
2024/07/16 20:20:11.487455 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=6.674830837s <-- wait too long
2024/07/16 20:20:11.506314 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=9.784367ms
2024/07/16 20:20:11.526660 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=13.911119ms
2024/07/16 20:20:11.545752 user_cache consumer=natssync2 : JsReadMessage get 200 msgs, readwait=9.892692ms
nats consumer info user_cache natssync2
# nats c info user_cache natssync2
Information for Consumer user_cache > natssync2 created 2024-07-16T16:26:43+08:00
Configuration:
Name: natssync2
Pull Mode: true
Deliver Policy: All
Ack Policy: Explicit
Ack Wait: 30.00s
Replay Policy: Instant
Max Ack Pending: 1,000
Max Waiting Pulls: 512
Cluster Information:
Name: nats_cluster
Leader: nats2
Replica: nats0, current, seen 78ms ago
Replica: nats1, current, seen 78ms ago
State:
Last Delivered Message: Consumer sequence: 33,008,777 Stream sequence: 65,277,151
Acknowledgment Floor: Consumer sequence: 33,008,777 Stream sequence: 65,277,151
Outstanding Acks: 0 out of maximum 1,000
Redelivered Messages: 0
Unprocessed Messages: 0
Waiting Pulls: 0 of maximum 512
Thank you!
Is the publishing rate more stable on 2.10.18?
@yuzhou-nj I am not dismissing that you report a difference in consumer rate between several versions, but saying that the consumer rate is affected, especially when publishing is no surprise. You are adding extra load to the server, especially that you are using the bench tool that does not use JetStream and sends NATS messages as fast as it can.
When there are a large number of messages in the stream, just reading and acknowledging the messages, even if no new messages are being inserted, this issue still exists. It seems that because the stream was deleting a large number of messages at that time, the client was unable to read the messages or the number of messages read decreased
Is the publishing rate more stable on 2.10.18?
The speed of reading messages is unstable, not publish. I also tried nats-server v2.10.18, and the issue still exists; however, versions v2.10.5 to v2.10.14 do not have this problem.
Note that you should not publish to JetStream with nats bench
unless using the --js
flag as well (otherwise you are doing Core NATS publications, which could potentially overwhelm the nats-server in charge of persisting the messages for that stream).
@yuzhou-nj I wonder if this issue is similar to https://github.com/nats-io/nats-server/issues/5702. I have investigated and reported my findings to rest of the eng team.
@yuzhou-nj You may want to re-run your tests with a build of the nats-server that contains https://github.com/nats-io/nats-server/pull/5719, which should be available in the nightly tonight (PST).
Observed behavior
Hi, I have developed a "consumer program", which reads messages from stream in pull mode. A maximum of 200 messages can be read at a time, and then sent to nginx through http+post messages. It is found that the number of messages read sometimes decreases and can be recovered automatically after a period of time. This problem starts to occur at nats-server v2.10.16.
There is no abnormal log in nat-server.log. When the number of read messages decreases, I can see through the nats stream state XXX that there are messages being deleted.
Thank you for your help.
Expected behavior
The consumer should read messages stably and continuously.
Server and client version
./nats-server -v
nats-server: v2.10.16
Host environment
CentOS 8
Steps to reproduce
stream info:
consumer info:
In nats-server v2.10.14 and earlier versions, the number of messages consumed per second is stable.
On nats-server v2.10.16 and 2.10.17, the number of consumed messages fluctuates. (look 19:18:50~55, succ=)
Thank you very much!