Closed quinox closed 1 month ago
Weird, neither my system nor the Github builders have that issue. I can even reduce it to 512.
Can you give some more info about your system, branch, compiler, etc, etc?
Happy to help. I can provide shell access if that makes it easier for you (I don't mind doing the legwork though).
The details:
7e5d0d0eec1312789895ae532a7c4202c5cabe90
(tagged v1.11.0) (clean state)./run-make-from-ci.sh
says: The CXX compiler identification is GNU 12.3.1
./run-make-from-ci.sh --compiler $CXX
says: The CXX compiler identification is Clang 17.0.6
[2024-05-03 16:43:07.442] [ERROR] Setting ulimit nofile failed: 'Operation not permitted'. This means the default is used.
quinox@gofu ~/p/F/F/buildtests (master)> ulimit --all -S
Maximum size of core files created (kB, -c) 0
Maximum size of a process’s data segment (kB, -d) unlimited
Control of maximum nice priority (-e) 0
Maximum size of files created by the shell (kB, -f) unlimited
Maximum number of pending signals (-i) 128081
Maximum size that may be locked into memory (kB, -l) 8192
Maximum resident set size (kB, -m) unlimited
Maximum number of open file descriptors (-n) 1024
Maximum bytes in POSIX message queues (kB, -q) 800
Maximum realtime scheduling priority (-r) 0
Maximum stack size (kB, -s) 8192
Maximum amount of CPU time in seconds (seconds, -t) unlimited
Maximum number of processes available to current user (-u) 128081
Maximum amount of virtual memory available to each process (kB, -v) unlimited
Maximum contiguous realtime CPU time (-y) unlimited
quinox@gofu ~/p/F/F/buildtests (master)> ulimit --all -H Maximum size of core files created (kB, -c) unlimited Maximum size of a process’s data segment (kB, -d) unlimited Control of maximum nice priority (-e) 0 Maximum size of files created by the shell (kB, -f) unlimited Maximum number of pending signals (-i) 128081 Maximum size that may be locked into memory (kB, -l) 8192 Maximum resident set size (kB, -m) unlimited Maximum number of open file descriptors (-n) 4096 Maximum bytes in POSIX message queues (kB, -q) 800 Maximum realtime scheduling priority (-r) 0 Maximum stack size (kB, -s) unlimited Maximum amount of CPU time in seconds (seconds, -t) unlimited Maximum number of processes available to current user (-u) 128081 Maximum amount of virtual memory available to each process (kB, -v) unlimited Maximum contiguous realtime CPU time (-y) unlimited
---
Does it not leak files for you, or does it not crash for you?
The grep for epoll is for no special reason except it shows the leakage nicely (note my limit is 1024):
$ strace -fF ./flashmq-tests 2>&1 | grep 'epoll_create.= [1-9][0-9]$' [pid 6338] epoll_create(999) = 4 [pid 6338] epoll_create(999) = 5 [pid 6340] epoll_create(999) = 9 [pid 6340] epoll_create(999) = 11 [pid 6340] epoll_create(999) = 13 [pid 6340] epoll_create(999) = 15 [pid 6340] epoll_create(999) = 17 [pid 6340] epoll_create(999) = 19 ... [pid 6338] epoll_create(999) = 1000 [pid 6338] epoll_create(999) = 1002 [pid 6338] epoll_create(999) = 1004 [pid 6338] <... epoll_create resumed>) = 1006 [pid 6338] <... epoll_create resumed>) = 1009 [pid 6338] <... epoll_create resumed>) = 1007 [pid 6338] <... epoll_create resumed>) = 1012 [pid 6338] <... epoll_create resumed>) = 1014 [pid 6338] <... epoll_create resumed>) = 1017 [pid 6338] <... epoll_create resumed>) = 1019 [pid 6338] <... epoll_create resumed>) = 1020 [pid 6338] epoll_create(999) = 39 [pid 6338] epoll_create(999) = 40 [pid 6884[2024-05-03 16:35:05.910] [DEBUG] Adding event 'keep-alive check' to the timer with an interval of 5000 d>) = 1023 fish: Process 6334, 'strace' from job 1, 'strace -fF ./flashmq-tests 2>&1…' terminated by signal SIGABRT (Abort)
Capturing the state using lsof -nPX
in a second window, the biggest capture I made:
lsof
takes time to run, the 172 handles over the 8192 limit are probably handles that already disappeared before lsof
was done (and that's why they are of type "unknown" below)quinox@gofu ~> gawk '{ print $7 }' /tmp/lsof_1714751372.txt | sort | uniq -c | sort -h -r
6880 a_inode
881 0
190 unknown
168 FIFO
144 REG
...
a_inode
handles:
quinox@gofu ~> gawk '$7 == "a_inode" { print $11 }' /tmp/lsof_1714751372.txt | sed 's/:.*//' | sort | uniq -c | sort -hr
3542 eventfd:$num
3338 eventpoll:$num
Thanks, that error from setrlimit
made it clear. It's interesting that doesn't work for you.
Anyway, It was kind of an accident I never ran into it. The setrlimit
it just something FlashMQ does, so it also did so in tests. Some epoll and eventfd file descriptors plainly lacked a close, or even a destructor to call close()
in... I fixed it.
Observation
Running all testcases in 1 go crashes on my system:
It always crashes on the same testcase.
The testcase itself runs fine:
If I raise my open file limit using
ulimit -Sn 4096
it goes much further but still can't make it to the end.