halfgaar / FlashMQ

FlashMQ is a fast light-weight MQTT broker/server, designed to take good advantage of multi-CPU environments
https://www.flashmq.org/
Open Software License 3.0
173 stars 24 forks source link

Running multiple testcases should not leak file descriptors #98

Closed quinox closed 1 month ago

quinox commented 1 month ago

Observation

Running all testcases in 1 go crashes on my system:

$ ./flashmq-tests 2>/dev/null
INIT: forkingTestForkingTestServer
RUN: forkingTestForkingTestServer
PASS: forkingTestForkingTestServer

...

INIT: testDowngradeQoSOnSubscribeQos0to0
RUN: testDowngradeQoSOnSubscribeQos0to0
PASS: testDowngradeQoSOnSubscribeQos0to0

INIT: testDowngradeQoSOnSubscribeQos1to0
RUN: testDowngradeQoSOnSubscribeQos1to0
FAIL EXCEPTION: testDowngradeQoSOnSubscribeQos1to0: Too many open files

INIT: testDowngradeQoSOnSubscribeQos1to1
fish: Job 1, './flashmq-tests 2>/dev/null' terminated by signal SIGABRT (Abort)

It always crashes on the same testcase.

The testcase itself runs fine:

$ ./flashmq-tests testDowngradeQoSOnSubscribeQos1to0 2>/dev/null
INIT: testDowngradeQoSOnSubscribeQos1to0
RUN: testDowngradeQoSOnSubscribeQos1to0
PASS: testDowngradeQoSOnSubscribeQos1to0

Tests run: 1. Passed: 1. Failed: 0 (of which 0 exceptions). Total assertions: 16.

TESTS PASSED

If I raise my open file limit using ulimit -Sn 4096 it goes much further but still can't make it to the end.

halfgaar commented 1 month ago

Weird, neither my system nor the Github builders have that issue. I can even reduce it to 512.

Can you give some more info about your system, branch, compiler, etc, etc?

quinox commented 1 month ago

Happy to help. I can provide shell access if that makes it easier for you (I don't mind doing the legwork though).

The details:

quinox@gofu ~/p/F/F/buildtests (master)> ulimit --all -H Maximum size of core files created (kB, -c) unlimited Maximum size of a process’s data segment (kB, -d) unlimited Control of maximum nice priority (-e) 0 Maximum size of files created by the shell (kB, -f) unlimited Maximum number of pending signals (-i) 128081 Maximum size that may be locked into memory (kB, -l) 8192 Maximum resident set size (kB, -m) unlimited Maximum number of open file descriptors (-n) 4096 Maximum bytes in POSIX message queues (kB, -q) 800 Maximum realtime scheduling priority (-r) 0 Maximum stack size (kB, -s) unlimited Maximum amount of CPU time in seconds (seconds, -t) unlimited Maximum number of processes available to current user (-u) 128081 Maximum amount of virtual memory available to each process (kB, -v) unlimited Maximum contiguous realtime CPU time (-y) unlimited


---

Does it not leak files for you, or does it not  crash for you?

The grep for epoll is for no special reason except it shows the leakage nicely (note my limit is 1024):

$ strace -fF ./flashmq-tests 2>&1 | grep 'epoll_create.= [1-9][0-9]$' [pid 6338] epoll_create(999) = 4 [pid 6338] epoll_create(999) = 5 [pid 6340] epoll_create(999) = 9 [pid 6340] epoll_create(999) = 11 [pid 6340] epoll_create(999) = 13 [pid 6340] epoll_create(999) = 15 [pid 6340] epoll_create(999) = 17 [pid 6340] epoll_create(999) = 19 ... [pid 6338] epoll_create(999) = 1000 [pid 6338] epoll_create(999) = 1002 [pid 6338] epoll_create(999) = 1004 [pid 6338] <... epoll_create resumed>) = 1006 [pid 6338] <... epoll_create resumed>) = 1009 [pid 6338] <... epoll_create resumed>) = 1007 [pid 6338] <... epoll_create resumed>) = 1012 [pid 6338] <... epoll_create resumed>) = 1014 [pid 6338] <... epoll_create resumed>) = 1017 [pid 6338] <... epoll_create resumed>) = 1019 [pid 6338] <... epoll_create resumed>) = 1020 [pid 6338] epoll_create(999) = 39 [pid 6338] epoll_create(999) = 40 [pid 6884[2024-05-03 16:35:05.910] [DEBUG] Adding event 'keep-alive check' to the timer with an interval of 5000 d>) = 1023 fish: Process 6334, 'strace' from job 1, 'strace -fF ./flashmq-tests 2>&1…' terminated by signal SIGABRT (Abort)

quinox commented 1 month ago

Capturing the state using lsof -nPX in a second window, the biggest capture I made:

halfgaar commented 1 month ago

Thanks, that error from setrlimit made it clear. It's interesting that doesn't work for you.

Anyway, It was kind of an accident I never ran into it. The setrlimit it just something FlashMQ does, so it also did so in tests. Some epoll and eventfd file descriptors plainly lacked a close, or even a destructor to call close() in... I fixed it.