shutdown() call in a6abd98d has broken emperor

unixwitch commented 7 years ago

a6abd98d introduced a call to shutdown(fd, SHUT_RDWR) in uwsgi_close_all_sockets(). This has caused our emperor to stop working when >= 2 vassals are configured; all workers get stuck in an infinite loop of epoll_wait/accept4:

[pid 27011] accept4(3, 0x7f9ac142b0a2, [110], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27011] epoll_wait(9, {{EPOLLIN|EPOLLHUP, {u32=3, u64=3}}}, 1, -1) = 1
[pid 27011] accept4(3, 0x7f9ac142b0a2, [110], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27011] epoll_wait(9, {{EPOLLIN|EPOLLHUP, {u32=3, u64=3}}}, 1, -1) = 1
[pid 27011] accept4(3, 0x7f9ac142b0a2, [110], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27011] epoll_wait(9, {{EPOLLIN|EPOLLHUP, {u32=3, u64=3}}}, 1, -1) = 1

strace suggests this is trigger by shutdown:

    [pid  5073] epoll_wait(12,  <unfinished ...>
    [pid  5060] chdir("/etc/tbx/uwsgi/conf.d.test") = 0
    [pid  5060] openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
    [pid  5060] getdents(5, /* 6 entries */, 32768) = 192
    [pid  5060] lstat("tbxwagtail.ini", {st_mode=S_IFLNK|0777, st_size=10, ...}) = 0
    [pid  5060] lstat("ciffwagtail.ini", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
    [pid  5060] getdents(5, /* 0 entries */, 32768) = 0
    [pid  5060] close(5)                    = 0
    [pid  5060] lstat("tbxwagtail.ini", {st_mode=S_IFLNK|0777, st_size=10, ...}) = 0
    [pid  5060] lstat("ciffwagtail.ini", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
    [pid  5060] wait4(-1, 0x7ffff578dccc, WNOHANG, NULL) = 0
    [pid  5060] epoll_wait(3,  <unfinished ...>
    [pid  5062] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fed6eb3ca50) = 5077
    Process 5077 attached
    [pid  5077] set_robust_list(0x7fed6eb3ca60, 24 <unfinished ...>
    [pid  5062] open("/home/ciffwagtail/redis-ciffwagtail.pid", O_RDONLY <unfinished ...>
    [pid  5077] <... set_robust_list resumed> ) = 0
    [pid  5062] <... open resumed> )        = 37
    [pid  5062] fstat(37, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
    [pid  5062] read(37, "9605\n", 5)       = 5
    [pid  5062] kill(9605, SIG_0)           = 0
    [pid  5062] close(37)                   = 0
    [pid  5077] shutdown(3, SHUT_RDWR)      = 0
    [pid  5073] <... epoll_wait resumed> {{EPOLLIN|EPOLLHUP, {u32=3, u64=3}}}, 1, -1) = 1
    [pid  5077] close(3 <unfinished ...>
    [pid  5073] accept4(3,  <unfinished ...>
    [pid  5077] <... close resumed> )       = 0
    [pid  5062] write(2, "[ciffwagtail] - [uwsgi-daemons] "..., 162 <unfinished ...>
    [pid  5077] close(3 <unfinished ...>
    [pid  5073] <... accept4 resumed> 0x7fed6bbf90a2, [110], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)

Commenting out this call to shutdown has fixed the problem for us: https://github.com/torchbox/uwsgi/commit/85ff4d1adab344a77a3b2cf9efb002f530ab8ffb

I don't know what the root cause of this is or what the correct behaviour should be.

gdamjan commented 7 years ago

close_all_sockets() is also called in core/utils.c:uwsgi_run_command core/mule.c:uwsgi_mule … core/daemons.c, core/spooler.c and some other places, so calling attach-daemon, using mules, spooler etc, will shutdown all the sockets, most of which the listening socket.

gdamjan commented 7 years ago

https://github.com/unbit/uwsgi/pull/1371 should've had more testing :/

unbit commented 7 years ago

Can you check if the latest patch fixes the issue ?

unixwitch commented 7 years ago

Yes, that seems to fix it (at least in quick testing).

xrmx commented 7 years ago

@unixwitch can we close this?

unixwitch commented 7 years ago

Yes, feel free to close this, we've had no problems since applying f262899 in our local build.

unbit / uwsgi

shutdown() call in a6abd98d has broken emperor #1409