unbit / uwsgi

uWSGI application server container
http://projects.unbit.it/uwsgi
Other
3.47k stars 692 forks source link

Some subprocesses in ruby applications hang #2521

Open vali-um opened 1 year ago

vali-um commented 1 year ago

Hi,

we're having some troubles with uwsgi (2.0.20/2.0.21), ruby (2.7/3.1) and redmine (5.0.4) regarding threads (master mode in uwsgi): calls to popen, backticks and shell with non-existing binaries fail and hang indefinitely (the shell stays in the process list as a zombie process). Redmine uses this to determine available tools.

Some example calls may be found here:

There are also some ruby bugs that seem to be relevant here:

We created a minimal rack app to reproduce this behaviour:

class App
    def shell()
        system("/bin/nonexistent")
        $?.to_s
    end

    def call(environ)
        [200, {'Content-Type' => 'text/html'}, [shell()]]
    end
end

run App.new

Steps to reproduce with uwsgi + rack plugin:

For us the issue arises in the minimal app if the request is made directly after uwsgi startup.

uwsgi --plugins rack --rack app.ru --daemonize app.log --master --processes 2 --http-socket 0.0.0.0:8080; curl localhost:8080

Result on our test machine: The request hangs, ps shows an zombie process (root 11109 0.0 0.0 0 0 ? Z 16:07 0:00 [uwsgi] <defunct>). A second request succeeds, immediately afterwards the first request also finishes.

Steps to reproduce with uwsgi + rack & rbthreads plugin:

Running uwsgi

uwsgi --plugins rack,rbthreads --rbthreads --threads 4 --rack app.ru --master --processes 2 --http-socket 0.0.0.0:8080

Result on our test machine: The first request works as expected. Subsequent requests will hang until the 5th request (= thread limit per worker +1). After the 5th request all still running requests finish.

We were not sure where to report this issue. Any suggestions or help would be highly appreciated.

vali-um commented 1 year ago

A short excerpt of one strace we managed to capture:

Request 1:

[pid  9419] read(16<TCP:[192.168.1.174:8080->192.168.2.171:57454]>, "GET / HTTP/1.1\r\nHost: test.srv"..., 4096) = 932
...
[pid  9419] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe4ff18ba50) = 9427
...
[pid  9427] execve("/bin/nonexistent", ["/bin/nonexistent"], 0x55aa3b067e90 /* 19 vars */) = -1 ENOENT (No such file or directory)
[pid  9427] write(18<pipe:[355911]>, "\2\0\0\0", 4) = 4
[pid  9419] <... read resumed>"\2\0\0\0", 4) = 4
[pid  9427] exit_group(127 <unfinished ...>
[pid  9419] close(17<pipe:[355911]> <unfinished ...>
[pid  9427] <... exit_group resumed>)   = ?
[pid  9419] <... close resumed>)        = 0
[pid  9427] +++ exited with 127 +++
[pid  9419] sched_yield()               = 0
[pid  9419] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=9427, si_uid=0, si_status=127, si_utime=0, si_stime=0} ---
[pid  9418] <... epoll_wait resumed>[], 1, 1000) = 0
[pid  9419] getpid( <unfinished ...>
[pid  9418] getsockopt(3<TCP:[0.0.0.0:8080]>, SOL_TCP, TCP_INFO,  <unfinished ...>
[pid  9419] <... getpid resumed>)       = 9419
[pid  9418] <... getsockopt resumed>"\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0d\0\0\0"..., [104]) = 0
[pid  9419] rt_sigreturn({mask=[]} <unfinished ...>
[pid  9418] wait4(-1,  <unfinished ...>
[pid  9419] <... rt_sigreturn resumed>) = 0
....

Request 1 doesn't finish.

Request 2:

[pid  9423] read(17<TCP:[192.168.1.174:8080->192.168.2.171:57470]>, "GET / HTTP/1.1\r\nHost: test.srv"..., 4096) = 889
...
[pid  9423] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe4fb0e69d0) = 9429
....
[pid  9429] execve("/bin/nonexistent", ["/bin/nonexistent"], 0x55aa3b067e90 /* 19 vars */) = -1 ENOENT (No such file or directory)
[pid  9429] write(19<pipe:[355929]>, "\2\0\0\0", 4) = 4
[pid  9423] <... read resumed>"\2\0\0\0", 4) = 4
[pid  9429] exit_group(127 <unfinished ...>
[pid  9423] close(18<pipe:[355929]> <unfinished ...>
[pid  9429] <... exit_group resumed>)   = ?
[pid  9423] <... close resumed>)        = 0
[pid  9429] +++ exited with 127 +++
[pid  9423] futex(0x55aa3b0467c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  9419] <... ppoll resumed>)        = ? ERESTARTNOHAND (To be restarted if no handler)
[pid  9419] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=9429, si_uid=0, si_status=127, si_utime=0, si_stime=0} ---
[pid  9419] getpid()                    = 9419
[pid  9419] rt_sigreturn({mask=[]})     = -1 EINTR (Interrupted system call)
[pid  9419] read(4<anon_inode:[eventfd]>, 0x7fff41be88d0, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid  9419] wait4(9429, [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], WNOHANG, NULL) = 9429
[pid  9419] futex(0x55aa3b0467c0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  9423] <... futex resumed>)        = 0
[pid  9419] <... futex resumed>)        = 1
[pid  9423] futex(0x55aa3b0467f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  9419] futex(0x55aa3b0467f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  9423] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid  9419] <... futex resumed>)        = 0
[pid  9423] futex(0x55aa3b0467f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  9419] wait4(9427,  <unfinished ...>
[pid  9423] <... futex resumed>)        = 0
[pid  9419] <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 127}], WNOHANG, NULL) = 9427
[pid  9423] futex(0x55aa3af5d1e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  9419] write(4<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  9419] futex(0x55aa3af5d1e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  9423] <... futex resumed>)        = 0
[pid  9419] <... futex resumed>)        = 1
[pid  9423] futex(0x55aa3af5d1e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  9419] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  9423] <... futex resumed>)        = 0
[pid  9419] <... clock_gettime resumed>{tv_sec=1144088, tv_nsec=344160939}) = 0
[pid  9423] writev(17<TCP:[192.168.1.174:8080->192.168.2.171:57470]>, [{iov_base="HTTP/1.1 200 OK\r\nContent-Type: t"..., iov_len=44}, {iov_base="error", iov_len=5}], 2 <unfinished ...>
[pid  9419] futex(0x55aa3af4dd10, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1144088, tv_nsec=444160939}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid  9423] <... writev resumed>)       = 49
[pid  9423] gettimeofday({tv_sec=1677602195, tv_usec=936455}, NULL) = 0
[pid  9423] close(17<TCP:[192.168.1.174:8080->192.168.2.171:57470]>) = 0
[pid  9423] writev(2</root/uwsgi-rack-race/uwsgi_race_full.strace2>, [{iov_base="[pid: 9419|app: 0|req: 2/1] 10.4"..., iov_len=193}], 1[pid: 9419|app: 0|req: 2/1] 192.168.2.171 () {36 vars in 1090 bytes} [Tue Feb 28 16:36:35 2023] GET / => generated 5 bytes in 23 msecs (HTTP/1.1 200) 1 headers in 44 bytes (0 switches on core 4)
) = 193
[pid  9423] futex(0x55aa3af4dd10, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  9419] <... futex resumed>)        = 0
[pid  9423] futex(0x55aa3a24e978, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  9419] futex(0x55aa3af5d110, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  9419] writev(16<TCP:[192.168.1.174:8080->192.168.2.171:57454]>, [{iov_base="HTTP/1.1 200 OK\r\nContent-Type: t"..., iov_len=44}, {iov_base="error", iov_len=5}], 2) = 49
[pid  9419] gettimeofday({tv_sec=1677602195, tv_usec=958538}, NULL) = 0
[pid  9419] close(16<TCP:[192.168.1.174:8080->192.168.2.171:57454]>) = 0
[pid  9419] writev(2</root/uwsgi-rack-race/uwsgi_race_full.strace2>, [{iov_base="[pid: 9419|app: 0|req: 2/2] 10.4"..., iov_len=195}], 1[pid: 9419|app: 0|req: 2/2] 192.168.2.171 () {40 vars in 1143 bytes} [Tue Feb 28 16:36:29 2023] GET / => generated 5 bytes in 6431 msecs (HTTP/1.1 200) 1 headers in 44 bytes (0 switches on core 0)
) = 195

Request 2 completes, and immediately afterwards request 1

Attached is the full strace: uwsgi_race_full.strace.txt