thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
223 stars 55 forks source link

unusually high open file descriptors when mosquitto bridge connection to Cumulocity IoT is not stable #2205

Closed reubenmiller closed 10 months ago

reubenmiller commented 1 year ago

Describe the bug

Potential file descriptor leak when the mosquitto bridge to Cumulocity IoT is connection then being rejected by the server (with "Not Authorized").

However the mosquitto bridge to Cumulocity IoT would actually get connected for a brief moment (as indicated by the health topic), however it would be disconnected seconds later.

[tedge/health/mosquitto-c8y-bridge] 1
[tedge/health/mosquitto-c8y-bridge] 0
[tedge/health/mosquitto-c8y-bridge] 1
[tedge/health/mosquitto-c8y-bridge] 0

The root cause for the bridge connect/disconnect is due to Cumulocity IoT having some issues after an upgrade when the clean session is set to False.

To Reproduce

There has been no easy way to reproduce this scenario as the situation occurred due to a Cumulocity IoT platform upgrade and the fact the mosquitto bridge was not using a clean session True. This would result in the bridge being briefly connected (and the `)

Expected behavior

In error states where the mosquitto bridge connection is not stable, then all of the thin-edge.io components should not cause the file descriptor limit (max open files) to be reached, as this affects the recovery of the device. For example, if the old sessions were manually deleted from the Cumulocity IoT server side, the mosquitto bridge connection would have been able to re-connect, however it was not able to as the max file descriptor limit had been reached.

Screenshots

The file descriptor counts per process were recorded on a device where mosquitto stop functionality as it had exceeded the max open files limit (as controlled by systemd).

The fd counts and the cmdline per process were then merged to show the unusually high file descriptor counts for mosquitto and some of the thin-edge.io components.

1023 /proc/566  /usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf
263 /proc/717   /usr/bin/c8y-log-plugin
263 /proc/691   /usr/bin/tedge-mapper c8y
262 /proc/653   /usr/bin/c8y-firmware-plugin
261 /proc/632   /usr/bin/c8y-configuration-plugin
86 /proc/1      /sbin/init
43 /proc/260    /lib/systemd/systemd-journald
24 /proc/377    /usr/libexec/bluetooth/bluetoothd
24 /proc/161387 /lib/systemd/systemd --user
22 /proc/870
21 /proc/396    /lib/systemd/systemd-logind
20 /proc/161395 /usr/bin/pulseaudio --daemonize=no --log-target=journal
19 /proc/379
19 /proc/378
18 /proc/401
16 /proc/544
15 /proc/540
14 /proc/278
13 /proc/161400
12 /proc/670    /usr/bin/tedge-agent
11 /proc/763    /usr/bin/tedge-mapper collectd
10 /proc/418
10 /proc/388
10 /proc/365
10 /proc/161406
9 /proc/358
9 /proc/161631
9 /proc/161382
8 /proc/740     /usr/bin/python3 /usr/lib/python3/dist-packages/fileparser_watchdog/parser.py
7 /proc/543
7 /proc/161417
7 /proc/161389
5 /proc/546
4 /proc/561
4 /proc/560
4 /proc/547
4 /proc/161418
4 /proc/161401
3 /proc/161635
3 /proc/161634
3 /proc/161633
3 /proc/161632

The ulimit -a shows that the limit is set to 1024 files:

$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 7783
max locked memory           (kbytes, -l) 249398
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 7783
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Environment (please complete the following information):

Additional context

The following commands were used to count the open file descriptors per process, and then display the command behind each process (e.g. the /proc/N/cmdline file).

find /proc -mindepth 1 -maxdepth 3 -path "/proc/[0-9]*/fd/*" | sed 's|/fd/.\+||g' | sort | uniq -c | sort -r
find /proc -maxdepth 2 -name cmdline -exec sh -c 'printf "%s - %s\n" "$1" "$(cat $1 | sed -e "s/\x00/ /g")"' sh {} \;
reubenmiller commented 1 year ago

I have suspect that something that some interaction with the request for the Cumulocity token is causing the high file descriptor counts / exhaustion. As the high counts is mostly seen on thin-edge.io components which require a jwt from Cumulocity.

It would be interesting to find out the following (though this is just one theory...you should not rule out other potential sources)

reubenmiller commented 1 year ago

The large number of file descriptors has not been reproducible yet. It seems that the specific scenario can only be invoked by replicating this behaviour from the server side (which is technically not feasible at this time). We're trying to setup some long term testing farms to monitor such long term behaviour to provide more clues in the future.

Though we are confident that under normal operation there is no leakage of file descriptors, and it is limited to the case where the cloud MQTT broker rejects the client connection after 1-2 seconds of successfully connecting. So there should be low risk of this happening.

reubenmiller commented 10 months ago

Closing as this has not been able to reproduce, though in generally use the following versions: