mqttjs / MQTT.js

The MQTT client for Node.js and the browser
Other
8.56k stars 1.42k forks source link

After mqtt client is reconnected, it is unable to continue publishing message #1727

Closed spencerfeng closed 5 months ago

spencerfeng commented 11 months ago

Hi there,

I am using the latest version of mqttjs (5.1.3) and this issue happens in old versions as well.

In my app, I received the offline event even when the mosquitto MQTT broker is up and running. After the mqttjs client goes through the process of offline -> closed -> reconnect -> connected. The mqttjs client can no longer publish messages again.

Is there a workaround?

app log: image

mqtt broker log: image

robertsLando commented 11 months ago

Can you provide a script that reproduces the issue?

jaketakula commented 10 months ago

Hey @robertsLando , we are having same issue in our react application in the fronend this time. the error in the chrome only shows it failed at createWebsocket with array buffer (sry I have not made screenshot for that). the only way to let it work is to close and re-open the chrome broswer in our case.

i think now reactjs and nodejs both has the same issue. Can i grab more attentions on this issue ? Thx.

Screenshot 2023-11-23 at 8 14 42 pm Screenshot 2023-11-23 at 8 14 58 pm
robertsLando commented 10 months ago

@jaketakula I need to see more details about the error and how to reproduce it. Also what version are you using?

jaketakula commented 9 months ago

@jaketakula I need to see more details about the error and how to reproduce it. Also what version are you using?

  • the mqttjs version being used is 5.1.3
robertsLando commented 9 months ago

this issue happens randomly in chrome broswer - roughly every1 or 2 weeks seen once.

you mean 1/2 weeks with the browser opened or what?

jaketakula commented 9 months ago

The browser never sends data to broker but keeps open and only receive msg from broker.

On Wed, 10 Jan 2024 at 7:02 pm, Daniel Lando @.***> wrote:

this issue happens randomly in chrome broswer - roughly every1 or 2 weeks seen once.

you mean 1/2 weeks with the browser opened or what?

— Reply to this email directly, view it on GitHub https://github.com/mqttjs/MQTT.js/issues/1727#issuecomment-1884358937, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCFQK2J5BDWLWTKXWGOTA33YNZDIDAVCNFSM6AAAAAA6UXDO66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGM2TQOJTG4 . You are receiving this because you were mentioned.Message ID: <mqttjs/MQTT .@.***>

robertsLando commented 8 months ago

Could be fixed by #1779 , someone could give a try to 5.3.5?

jaketakula commented 8 months ago

Thank you very much. I will bump the version now. Will update you later on.

On Wed, 24 Jan 2024 at 1:31 am, Daniel Lando @.***> wrote:

Could be fixed by #1779 https://github.com/mqttjs/MQTT.js/pull/1779 , someone could give a try to 5.3.5?

— Reply to this email directly, view it on GitHub https://github.com/mqttjs/MQTT.js/issues/1727#issuecomment-1906177784, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCFQK2MTWWRAHR2OIE2T65TYP7CVXAVCNFSM6AAAAAA6UXDO66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBWGE3TONZYGQ . You are receiving this because you were mentioned.Message ID: <mqttjs/MQTT .@.***>

robertsLando commented 8 months ago

@jaketakula Thanks! Any news?

jaketakula commented 8 months ago

It usually take 1 or 2 weeks to see the issue. So pls keep patient. Thx.

On Wed, 24 Jan 2024 at 6:26 pm, Daniel Lando @.***> wrote:

@jaketakula https://github.com/jaketakula Thanks! Any news?

— Reply to this email directly, view it on GitHub https://github.com/mqttjs/MQTT.js/issues/1727#issuecomment-1907537193, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCFQK2OEEP6NCUB5RICMHEDYQCZTTAVCNFSM6AAAAAA6UXDO66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXGUZTOMJZGM . You are receiving this because you were mentioned.Message ID: <mqttjs/MQTT .@.***>

overflowz commented 8 months ago

any news on this? we have a similar issue so far, we did upgrade to the latest version but want to make sure before we deploy (iot devices) it won't happen again.

robertsLando commented 8 months ago

I didn't checked my own as I never faced this issue, dunno if @jaketakula has news (but I think he would have write if a bug happended). Recent changes fixed a very old bug in reconnect/keep alive that could have caused it BTW

overflowz commented 7 months ago

we deployed 10 IoT devices and slowly they're getting disconnected one by one without reconnecting. So the issue still persists.

Here's the config I'm using:

    this.#conn = mqtt.connect(url, {
      cert: CERT_PATH,
      key: KEY_PATH,
      ca: CA_PATH,
      protocolId: "MQTT",
      protocolVersion: 5,
      encoding: "binary",
      clean: false,
      clientId: 'specific-IoT-id',
      keepalive: 60,
      reconnectPeriod: 1000,
      connectTimeout: 30000,
      reschedulePings: false,
      queueQoSZero: true,
      resubscribe: true,
      manualConnect: false,
    });

I also tried to add timeout on publish (since when sending QoS 1 messages, it waits until delivery happens) but fails miserably. pseudo code:

await Promise.race([publish, timer]);
if timeout then client.reconnect();

but after I'm using reconnect() method, every publish request made throws "client disconnecting" error and can't recover from it.

EDIT: can you suggest what shall I do? workaround will also do cause we're in rush right now.

EDIT2: I forgot to run npm install :man_facepalming: I'll test again and get back as soon as I have news.

Yup, same thing.

robertsLando commented 7 months ago

@overflowz Please open a new bug issue and follow the steps in order to also attach DEBUG logs

overflowz commented 7 months ago

The problem is, it's hard to reproduce and you gotta wait for days or even weeks for it to trigger (this is for the reconnect). As for the reconnect() -- will do later today.

robertsLando commented 7 months ago

I understand that but it's hard for me to know what's going on here without more info,,, what you could do is to also patch the log function in client and print logs to a file so you don't loose them when this happens

AndreMaz commented 5 months ago

I'm not sure if it's related but I'm facing a similar problem.

Context My arch looks like this:

device -- Ethernet connection --> manager -- 5G connection --> MQTT broker
                                     |
                                     L______ 5G connection --> HTTP server

what happens is that every X hours the 5G modem goes down for a short period of time. When this happens I can see in the manager's logs that when it tries to communicate via HTTP it receives a EHOSTUNREACH error, which disappears when the 5G connection is back.

The problem is that when the connection is back the client.connected flips to true but the messages are not being sent.

My sender function looks something like this:

async function sendToBroker(topic, message) {
  if (!this.client.connected) {
    console.warn("Client is not connected, storing message");

    this.storage.store({ topic, message });

    return;
  }

  console.debug("Sending message to broker", message);
  await this.client.publishAsync(topic, message);
}

that is called like this:


async function send(ctx) {
    ctx.call("mqtt.sendToBroker", {
      topic: "topic",
      message: "message",
    }, 
    {
        timeout: 10000, // Throw an error if it takes more than 10 seconds
    });
}

In the logs, I can see Sending message to broker that is followed by a timeout error saying that sendToBroker() did not resolve in 10 seconds. I'm assuming that it gets stuck in the await this.client.publish(topic, message); line.

Here's how I create my client:

this.client = mqtt.connect(url, {
  username: token,
  cert: fs.readFileSync(certPath, "utf8"),
  rejectUnauthorized: false,
});

Question Any idea why this is happening? Is there a way to check if the connection is really up?

Might be vaguely related to: https://github.com/mqttjs/MQTT.js/issues/1825

robertsLando commented 5 months ago

@AndreMaz Could you create a full script that I can use to reproduce the issue? By checking the other issues it seems this happens only when working with tls?

AndreMaz commented 5 months ago

@robertsLando It will be difficult as it's a proprietary code but I'll try to create a repro example

By checking the other issues it seems this happens only when working with tls?

I've been using TLS since the beginning but can't say for sure if it's the source of the problem

overflowz commented 5 months ago

Can confirm, we're also using with tls, haven't tried otherwise.

robertsLando commented 5 months ago

I don't need your source code but a scripts that reproduces the issue. An easy one that connects to a broker with TLS (you can use hivemq public one https://www.hivemq.com/mqtt/public-mqtt-broker/) and then try to reproduce the disconnect and see if the problem happens. I tried last time without success and if I cannot reproduce it on my side it's hard to fix

AndreMaz commented 5 months ago

Yep yep, I know that you don't want it. I've created a simple script that mimics the data-flow but I can't reproduce the issue at home. I've been switching my laptop between WiFi, Ethernet, and 5G hotspot and so far no luck, ie, I did not see the timeout that I've mentioned.

overflowz commented 5 months ago

if it helps, since we updated to version 5.3.5, we're facing this issue less frequently, but it's still there. I believe these were the the relevant chages that could've affected it #1779

but in the latest release, there are fixes for the possible race condition again: #1848 but it's hard for us to keep updating many iot devices since we have to do a npm install where connection is very unstable.

these issues are really hard for us to test in a production environment and it also costs us a lot. is there any version that is considered stable so we can pick it for now until these issues gets fixed?

thank you!

AndreMaz commented 5 months ago

I also think that this is a regression that was introduced in the latest releases (don't know which one tho)

I'll have to check my previous releases and test them out. Will keep you guys updated

robertsLando commented 5 months ago

I'm sorry for the issues guys and I would like to help you but it's hard to guess what could be the root cause here, we should firstly try to find out an easy way to reproduce the issue somehow

overflowz commented 5 months ago

I'm sorry for the issues guys and I would like to help you but it's hard to guess what could be the root cause here, we should firstly try to find out an easy way to reproduce the issue somehow

No worries at all! We're devs too and we understand the frustration :-) Just to be clear, I didn't mean it as a "fix it asap", I do appreciate the work the maintainers are doing, truly. I was asking if there are any old version(s) I could try so I won't get pressured by the company to fix a problem that is hard to explain why it does not work sometimes xD

Regardless, I really do wish to help somehow too, but it's really hard to reproduce :( We're currently running the code on 50 devices and it might happen once a week, two weeks or even months per one device, it's really unpredictable and random.

robertsLando commented 5 months ago

I'm the only maintainer here unfortunately and I started because I use this package in almost all my projects and I wanted to help keeping it maintained (as it was almost died)

Based on the first comment of this issue seems this was happening also with older versions so I dunno, I'm sorry :(

AndreMaz commented 5 months ago

Hey @robertsLando no need to apologize. Huge kudos to you for what you're doing :muscle:

I'll have to check my previous releases and test them out.

Just checked, I went from 5.3.3 -> 5.5.2 -> 5.5.4

I'm rolling back to 5.3.3 and going to let it run for a while. Will keep you updated

robertsLando commented 5 months ago

@AndreMaz Thanks!

overflowz commented 5 months ago

Hey @robertsLando no need to apologize. Huge kudos to you for what you're doing 💪

I'll have to check my previous releases and test them out.

Just checked, I went from 5.3.3 -> 5.5.2 -> 5.5.4

I'm rolling back to 5.3.3 and going to let it run for a while. Will keep you updated

if it helps, we were using 5.3.3 previously and the issue was appearing more frequently than now (about 3-4 times a week).

robertsLando commented 5 months ago

Is there a way for you guys to override the log function in client constructor and store logs somewhere? It would be very helpful to see the difference in logs between a normal reconnect and one that causes the stop of publishing

AndreMaz commented 5 months ago

Yep, I can. I'm going to try to collect the logs

robertsLando commented 5 months ago

Better if using latest version

AndreMaz commented 5 months ago

Yep yep. No problem

overflowz commented 5 months ago

sadly not for me :( we're using remote iot devices on 3g, which also are limited on network bandwidth. is there a way to turn on logging only for connect/reconnects (or anything related to connections only)?

robertsLando commented 5 months ago

is there a way to turn on logging only for connect/reconnects (or anything related to connections only)?

Nope unfortunately

AndreMaz commented 5 months ago

Hey @robertsLando Is this enough info to locate the issue?

[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: publish :: message `%s` to topic `%s` {"ts":1714391751051,"values":{"cpuCores":8}} v3/telemetry
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: publish :: qos 0
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: MqttClient:publish: packet cmd: %s publish
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  start mqttjs_1d7221de
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: _writePacket :: packet: %O { cmd: 'publish', topic: 'v3/telemetry', payload: '{"ts":1714391751051,"values":{"cpuCores":8}}', qos: 0, retain: false, messageId: 0, dup: false }
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: emitting `packetsend`
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: writing to stream
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: writeToStream result %s false
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: handle events on `drain` once through callback.
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  end mqttjs_1d7221de
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: publish :: message `%s` to topic `%s` {"cpuCores":8} v3/attributes
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: publish :: qos 0
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: MqttClient:publish: packet cmd: %s publish
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  start mqttjs_1d7221de
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: packet: %O { cmd: 'publish', topic: 'v3/attributes', payload: '{"cpuCores":8}', qos: 0, retain: false, messageId: 0, dup: false }
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: emitting `packetsend`
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: writing to stream
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: writeToStream result %s false
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: handle events on `drain` once through callback.
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  end mqttjs_1d7221de
[2024-04-29T11:55:51.285Z] WARN  BROKER: Request 'mqtt.telemetry' timed out. { requestID: 'd0c3564e-d865-44f5-ad6e-efac7eecfd66', nodeID: 'misty-violet-6717', timeout: 10000 }
[2024-04-29T11:55:51.286Z] WARN  BROKER: Request 'mqtt.sendAttributes' timed out. { requestID: '25d2dd35-f844-4399-b87a-48d02e09e9de', nodeID: 'misty-violet-6717', timeout: 10000 }

The last 2 lines mean that my telemetry action

async function telemetry(topic, message) {
  if (!this.client.connected) {
    this.storage.store({ topic, message });
    return;
  }

  await this.client.publishAsync(topic, message);
}

did not resolve in 10sec

robertsLando commented 5 months ago

@AndreMaz I don't see the reconnection in the logs, also I see the first packet is sent correctly and then the second one nope but nothing that tells the why

@mcollina Any clue what could cause this issue? It seems to happen only with TLS

AndreMaz commented 5 months ago

@robertsLando this is interesting. The logs that I've shared previously were from 11:55:51 but the re-connection logic only appears at 12:01:22

[2024-04-29T12:01:22.745Z] DEBUG MQTT-CLIENT: streamErrorHandler :: error read ENETUNREACH
[2024-04-29T12:01:22.745Z] DEBUG MQTT-CLIENT: streamErrorHandler :: emitting error
[2024-04-29T12:01:22.746Z] ERROR MQTT-SERVICE: [mqtt] Client error. Error: read ENETUNREACH
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
  errno: -101,
  code: 'ENETUNREACH',
  syscall: 'read'
}
[2024-04-29T12:01:22.748Z] DEBUG MQTT-CLIENT: (%s)stream :: on close mqttjs_1d7221de
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: _flushVolatile :: deleting volatile messages from the queue and setting their callbacks as error function
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: stream: emit close to MqttClient
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: close :: connected set to `false`
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: close :: clearing connackTimer
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: close :: destroy ping timer
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: close :: calling _setupReconnect
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: _setupReconnect :: emit `offline` state
[2024-04-29T12:01:22.750Z] INFO  MQTT-SERVICE: MQTT client is offline.
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: _setupReconnect :: set `reconnecting` to `true`
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: _setupReconnect :: setting reconnectTimer for %d ms 1000
[2024-04-29T12:01:22.750Z] INFO  MQTT-SERVICE: Disconnected from MQTT endpoint.
[2024-04-29T12:01:23.377Z] WARN  BROKER: Request 'mqtt.sendAttributes' is timed out. { requestID: '852d8692-9108-4ff1-b417-1ff969f0f43c', nodeID: ', timeout: 10000 }

and then again at 12:16:01

[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _flushVolatile :: deleting volatile messages from the queue and setting their callbacks as error function
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: stream: emit close to MqttClient
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: connected set to `false`
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: clearing connackTimer
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: destroy ping timer
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: calling _setupReconnect
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _setupReconnect :: emit `offline` state
[2024-04-29T12:16:01.374Z] INFO  THINGSBOARD: MQTT client is offline.
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _setupReconnect :: set `reconnecting` to `true`
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _setupReconnect :: setting reconnectTimer for %d ms 1000
[2024-04-29T12:16:01.374Z] INFO  THINGSBOARD: Disconnected from MQTT endpoint.
[2024-04-29T12:16:02.375Z] DEBUG MQTT-CLIENT: reconnectTimer :: reconnect triggered!
[2024-04-29T12:16:02.376Z] DEBUG MQTT-CLIENT: _reconnect: emitting reconnect to client
[2024-04-29T12:16:02.376Z] INFO  THINGSBOARD: Reconnecting to MQTT endpoint...
[2024-04-29T12:16:02.376Z] DEBUG MQTT-CLIENT: _reconnect: calling connect
[2024-04-29T12:16:02.378Z] DEBUG MQTT-CLIENT: connect :: calling method to clear reconnect
[2024-04-29T12:16:02.379Z] DEBUG MQTT-CLIENT: _clearReconnect : clearing reconnect timer
[2024-04-29T12:16:02.379Z] DEBUG MQTT-CLIENT: connect :: using streamBuilder provided to client to create stream
[2024-04-29T12:16:02.381Z] DEBUG MQTT-CLIENT: connect :: pipe stream to writable stream
[2024-04-29T12:16:02.381Z] DEBUG MQTT-CLIENT: connect: sending packet `connect`
robertsLando commented 5 months ago

this is interesting. The logs that I've shared previously were from 11:55:51 but the re-connection logic only appears at 12:01:22

It could be that the connection was already broken but not closed yet so the write to stream happened but you didn't received any response back. Are you using a MQTT packet store? What is your keepalive in options?

AndreMaz commented 5 months ago

Are you using a MQTT packet store?

No

What is your keepalive in options?

Everything is default

Here's how I create my client:

this.client = mqtt.connect(url, {
  username: token,
  cert: fs.readFileSync(certPath, "utf8"),
  rejectUnauthorized: false,
});
robertsLando commented 5 months ago

@AndreMaz I don't see anything strange on your logs. The logic is correct:

Then it would be interesting to see what happens next when packets are not sent anymore

AndreMaz commented 5 months ago

But then why at 11:55 it tried to send data, which means that connected=true, but then failed to resolve the publish promise?

robertsLando commented 5 months ago

which means that connected=true, but then failed to resolve the publish promise?

I think because the connection was not broken yet? What I don't understand is that you have a keepalive of 60s but the disconnect happens much later 🤔 I think I just have discovered a possible bug here as I'm shifting the keepalive check every time we send a packet but we should do it every time we successfully write it to stream instead.

Dunno if this could solve the issue but it's worth a try

robertsLando commented 5 months ago

Ok yes I think to have found the root cause of the problem guys 🎉 Could someone give a try to the PR #1855?

overflowz commented 5 months ago

If that's gonna fix it, I'm gonna cri

robertsLando commented 5 months ago

Waiting for your feedback so! It works on my side and it also makes sense, I wasn't able to reproduce it on my side because I was publishing a message every time I received one, in your case you keep publish independently and when I did this I noticed the keepalive timeout was never happening and that's the error here, we was shifting the keepalive check on every message sent (so even on publish) causing it to never trigger even if the connection was broken

AndreMaz commented 5 months ago

@robertsLando thank you! Just deployed. Will leave it running for a couple of days.

I'll let you know the result

robertsLando commented 5 months ago

I just pushed a little change to avoid possible race conditions, please redeploy with latest commit.

Now I will write some tests to better cover this things

AndreMaz commented 5 months ago

@robertsLando yep yep. Re-deploying it now