thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
210 stars 54 forks source link

mosquitto unresponsive and using 100% CPU #2692

Open reubenmiller opened 4 months ago

reubenmiller commented 4 months ago

Describe the bug

On a device, it was observed that the local mosquitto MQTT broker was unresponsive and the mosquitto process as consuming 100% of the CPU processing time of 1 core.

mosquitto was reporting the following error, and after some time no more logs were being written:

OpenSSL Error[0]: error:0A000126:SSL routines::unexpected eof while reading

The local MQTT broker was non-functional which resulted in all of the thin-edge.io components failing to connect to the broker with the following network connection error:

Feb 08 13:13:11 rackfslot1 tedge-mapper[3356145]: 2024-02-08T12:13:11.071637229Z ERROR mqtt_channel::connection: MQTT connection error: Network timeout
Feb 08 13:13:16 rackfslot1 tedge-mapper[3356145]: 2024-02-08T12:13:16.07245946Z ERROR mqtt_channel::connection: MQTT connection error: Network timeout
Feb 08 13:13:21 rackfslot1 tedge-mapper[3356145]: 2024-02-08T12:13:21.073309857Z ERROR mqtt_channel::connection: MQTT connection error: Network timeout

Manually subscribing to the local MQTT broker on localhost:1883 was also met with a Network timeout error.

mosquitto was able to be revived only by restarting the service, using:

systemctl restart mosquitto

Afterwards all of the services started functioning again.

Secondary symptoms

The following were some secondary symptoms which were observed when the device was in this state.

To Reproduce

This situation has not been able to be reproduced yet, however there seems to be some correlation between the Cumulocity IoT update occurring and this mosquitto high CPU behaviour.

Expected behavior

Screenshots

Environment (please complete the following information): Property Value
OS [incl. version] Debian GNU/Linux 12 (bookworm)
Hardware [incl. revision] Raspberry Pi 4 Model B Rev 1.5
System-Architecture Linux rackfslot1 6.1.0-rpi6-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.58-1+rpt2 (2023-10-27) aarch64 GNU/Linux
thin-edge.io version tedge 0.13.2~141+g1ef77c9
mosquitto version 2.0.11

Additional context

log files

The following log files were collected for two devices, one where mosquitto was using 100% CPU and the other was device resumed the c8y-bridge connection after the Cumulocity IoT update.

Mitigation strategy

A mitigation strategy would be to use a service like monit to detect the situation where the CPU usage spikes for the mosquitto broker and restart it if it has sustained high CPU load.

reubenmiller commented 4 months ago

The current theory is that the open file limit was exhausted, and caused mosquitto to become unresponsive, however it has been difficult to reproduce this scenario exactly.

Here are some mosquitto tickets:

reubenmiller commented 4 months ago

There is also a theory that the high CPU behaviour might be due to an older version of mosquitto (e.g. < 2.0.18).

On Debian bookworm, mosquitto 2.0.18 (or newer) can be installed via the bookworm-backports repo (see instructions below):

  1. Edit the apt sources list

    /etc/apt/sources.list
    # Backports
    deb http://deb.debian.org/debian bookworm-backports main contrib non-free non-free-firmware
    deb-src http://deb.debian.org/debian bookworm-backports main contrib non-free non-free-firmware
  2. Update mosquitto to the latest available version

    apt-get update
    apt-get install mosquitto -t bookworm-backports
reubenmiller commented 4 months ago

The memory used by the thin-edge.io components has levelled out (but the memory has not been released) after restarting the mosquitto service.

image