nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.92k stars 1.41k forks source link

MQTT client not receiving messages #5282

Closed fly2F closed 6 months ago

fly2F commented 7 months ago

Observed behavior

I am having trouble receiving messages with my MQTT client.

Check the durable consumer. The Consumer sequence in Delivered Message is increasing, but ack is always 0.

Workarounds:

Expected behavior

The client should receive the message.

Server and client version

Server: Nats 2.10.7, 1 Node Client: Java org.eclipse.paho.client.mqttv3

Host environment

Host: Amazon Linux 2, 5.10.184-175.749.amzn2.x86_64 Docker: 20.10.25

Steps to reproduce

  1. Connect the MQTT client to the server.
  2. Check the connection status with connz. The connection status is normal.
  3. Check the connz sub list. It contains a topic and a delivery subject starting with $MQTT.sub.
  4. Check the connz delivery subject. It is consistent with the records in $MQTT.sess and the configuration in the corresponding durable consumer.
  5. Check the out_bytes and out_msg with connz. They are increasing.
  6. Set the log level to Trace. You can see the mqtt client going online and starting the subscription, but you can't see the client publish records.
  7. Check the durable consumer. The Consumer sequence in Delivered Message is increasing, but ack is always 0.
derekcollison commented 7 months ago

If you could, please make sure you are running the latest server version, which currently is 2.10.12.

fly2F commented 7 months ago

If you could, please make sure you are running the latest server version, which currently is 2.10.12.

Thank you very much for your reply!

I have previously upgraded the NATS version to 2.10.12, and I have not reproduced the issue on version 2.10.12 and after restarting to version 2.10.7.

I have not found a way to trigger the issue.

fly2F commented 7 months ago

This is the detail of the consumer I have problems with. The Consumer sequence in this consumer is strange

❯ nats consumer info '$MQTT_msgs' 'HgSLhmro_UKKHlQPcxPpTI6R1YNqdx3'
Information for Consumer $MQTT_msgs > HgSLhmro_UKKHlQPcxPpTI6R1YNqdx3 created 2024-03-29T12:47:12+08:00

Configuration:

            Durable Name: HgSLhmro_UKKHlQPcxPpTI6R1YNqdx3
        Delivery Subject: $MQTT.sub.UKKHlQPcxPpTI6R1YNqdst
          Filter Subject: $MQTT.msgs.localserver.shopId.78880634
          Deliver Policy: New
              Ack Policy: Explicit
                Ack Wait: 30.00s
           Replay Policy: Instant
         Max Ack Pending: 1,024
            Flow Control: false

State:

  Last Delivered Message: Consumer sequence: 541,868 Stream sequence: 563 Last delivery: 9.04s ago
    Acknowledgment Floor: Consumer sequence: 0 Stream sequence: 0
        Outstanding Acks: 42 out of maximum 1,024
    Redelivered Messages: 42
    Unprocessed Messages: 0
         Active Interest: Active
(base) ~ on ☁️  (us-west-2)
fly2F commented 7 months ago

Here is the relevant connz information. out_bytes and out_msg are increasing.

image

fly2F commented 7 months ago

I noticed this issue https://github.com/nats-io/nats-server/issues/4736. Although I am using mqtt, my consumer behavior is very similar to the description in this comment https://github.com/nats-io/nats-server/issues/4736#issuecomment-1955690764 . It is possible that the two issues are related.

levb commented 7 months ago

@fly2F I am going to pursue this from the MQTT angle, let @derekcollison comment on whether there may be an underlying JetStream issue. To make sure I understand this correctly, sounds like your client was first receiving (all) MQTT messages, until it stopped? (and then one of your workarounds would make the messages flow again?)

Are the messages you mention QOS1 or QOS2 MQTT messages?

fly2F commented 7 months ago

@fly2F I am going to pursue this from the MQTT angle, let @derekcollison comment on whether there may be an underlying JetStream issue. To make sure I understand this correctly, sounds like your client was first receiving (all) MQTT messages, until it stopped? (and then one of your workarounds would make the messages flow again?)

Are the messages you mention QOS1 or QOS2 MQTT messages?

Thank you for your reply.

I am using MQTT QOS 1 and CleanSession is false.

Some of our MQTT clients reported the issue almost on the same day.

They can receive messages in the morning, but not in the afternoon.

We turned on the Trace level log and observed that the server did not send PUBLISH to the MQTT client, but the out_msgs in connz increased.

fly2F commented 7 months ago

@fly2F I am going to pursue this from the MQTT angle, let @derekcollison comment on whether there may be an underlying JetStream issue. To make sure I understand this correctly, sounds like your client was first receiving (all) MQTT messages, until it stopped? (and then one of your workarounds would make the messages flow again?)

Are the messages you mention QOS1 or QOS2 MQTT messages?

I have also observed that the Consumer sequence in the Last Delivered Message of the consumer information is increasing by 16 every 30 seconds (Ack Wait).

These observations are made while the client is online.

levb commented 7 months ago

@fly2F Thank you, I am trying to reproduce the issue now. Any more MQTT-related context/estimates you can provide? How many clients are there, do they re-connect often/at all prior to the issue, traffic # of messages and size? Anything "interesting" about the subscriptions? Wildcards used?

fly2F commented 7 months ago

@levb Thank you.

I have two single-node NATS servers that experienced problems with MQTT clients not receiving messages within 2 days. These two servers are used for testing. One has about 50 clients and the other has about 30. They do not use wildcards and have very little traffic. The number of messages sent is roughly between 200 and 500. I also have three 3-node NATS clusters with more clients than these two, and they have never had this problem.

I restarted one server and upgraded the other to 2.10.12. It ran for 3-5 days and the problem did not recur. I did not find a way to reproduce it.

Since we use fixed Client IDs, there are often situations where Client IDs are repeatedly kicked offline on these two test servers. However, I specifically tested the situation of using two identical client IDs to preempt each other for two consecutive days, and it could not be reproduced. this problem.

The message size is small, about 100 bytes

levb commented 7 months ago

Since we use fixed Client IDs, there are often situations where Client IDs are repeatedly kicked offline on these two test servers. However, I specifically tested the situation of using two identical client IDs to preempt each other for two consecutive days, and it could not be reproduced. this problem.

Thanks! We did fix a number of "mechanical" (reliability) issues is 2.10.12 relative to the the early 2.10 releases, so please monitor on your end and I will add a long-running test to reflect your use-case.

fly2F commented 7 months ago

@levb Thanks!

The strange thing is that when the mqtt client connection cannot receive messages, out_msgs and out_bytes in connz are increasing, but the Trace level log is not included in the PUBLISH log of the client.

Can we record some mqtt metrics during PUBLISH (such as mqttEnqueuePublishMsgTo) to compare it with out_msgs metrics ?

fly2F commented 6 months ago

I have upgraded the version and I cannot reproduce the problem.

@levb @derekcollison Thanks!