thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
216 stars 54 forks source link

Improve `tedge mqtt sub` error handling #1797

Open Bravo555 opened 1 year ago

Bravo555 commented 1 year ago

Describe the bug

Maybe it's not a bug per se, but while working on MQTT client authentication, when the broker refuses the connection, the tedge mqtt sub tedge/measurements command does not exit, but keeps attempting to connect and outputting ERROR: Connection refused, return code: NotAuthorized without exiting. This is different from the case when the broker is not even running, and the process exits with the following message:

Error: failed to subscribe the topic "tedge/measurements" with QoS "AtMostOnce".

Caused by:
    MQTT connection error: I/O: Connection refused (os error 111)

    Hint: Is MQTT server running?

The process also correctly exits with a failure when we attempt to publish a message instead of subscribing.

To Reproduce

  1. In /etc/tedge/mosquitto-conf/tedge-mosquitto.conf, change allow_anonymous to false
  2. sudo systemctl restart mosquitto
  3. tedge mqtt sub tedge/measurements
  4. The command should print the NotAuthorized error message every 1s.

Expected behavior I think this behaviour should be made more consistent and the process should exit first time it receives NotAuthorized message from the broker, as it is unlikely that retrying would help, and so that the command can properly report the failure if it's being called e.g. from a script. There are probably other possible errors where we should just exit and report the failure to the users, and perhaps others where exiting would be unwise, and some other strategy, like exponential backoff, should be used. While fixing the behaviour for NotAuthorized case, we could consider these other cases as well.

Bravo555 commented 1 year ago

I noticed it also affects other MQTT clients e.g. tedge-agent, tedge-mapper, etc. These clients continuously reattempt connection, causing high CPU usage of themselves, and of the broker, quickly filling up mosquitto log, and causing high RAM usage in mosquitto:

htop showing high mosquitto CPU and RAM usage

With them, there are 2 issues:

  1. As above, these clients keep attempting to reconnect very quickly
  2. They do not fail and quit when mosquitto broker is shut down, but keep being restarted by systemd, possibly once every second. This is not as fast as with the 1st case, but still suboptimal.

Fixing MQTT channel API used by these different components would fix the case 1 for all of them, but we should also consider fixing case 2 by putting some kind of restart limit on relevant systemd units, so they're not being restarted indefinitely when MQTT broker is not available.

didier-wenzek commented 1 year ago

This is somehow related to https://github.com/thin-edge/thin-edge.io/issues/1778 and must be fixed accordingly.

didier-wenzek commented 1 year ago

I agree that there is no point to keep trying on NotAuthorized and maybe some other errors. However, I do think that the agent and the mapper should not quit when the broker is down, but must keep trying to reconnect (and adding exponential back-off is indeed a good idea). Moving to systemd the reconnect strategy is not appropriate as thin-edge must not depend on systemd.

reubenmiller commented 6 months ago

@Bravo555 Can you check if this is still a problem?

Bravo555 commented 6 months ago

@Bravo555 Can you check if this is still a problem?

The first issue, of tedge sub trying to retry when it receives NotAuthorized is still there, but as it was said, in some cases we want to retry because we want tedge services to connect to MQTT broker as soon as it becomes available, and we don't want to depend too much on systemd, so it makes sense for these services to retry themselves instead of depending on systemd for being restarted. It's only a bit inconsistent for a tedge mqtt command, where a) pub exits but sub retries, and b) tedge mqtt command is usually being run as a human, so we don't need to retry. For this i think the issue is only in tedge sub command, only it would have to be adjusted, with no changes to mqtt_channel API.

The spinning of mosquitto and MQTT services, on the other hand, I could no longer reproduce, either because of I did not do it exactly the same way as I had before, or because a different version of mosquitto is used. However it is still possible to cause the spin, as evidenced by #2728 , which I think is the obvious and pressing bug and must be fixed.