Periodic error when reloading uwsgi with Master FIFO

JonathanHuot commented 7 years ago

Hi, I have sometime errors when I'm sending a reload instruction to Master FIFO with uWSGI 1.9.18.2.

uWSGI configuration is :

uid = nginx
gid = nginx
master-fifo = /var/run/nginx/uwsgi.masterfifo

Permissions looks good :

ls -l /var/run/nginx/
total 4
drwxr-xr-x   2 nginx nginx 4096 Jan 19 04:31 .
drwxr-xr-x. 23 root  root  4096 Jan 19 15:21 ..
prw------- 1 nginx nginx 0 Jan 19 04:31 uwsgi.masterfifo

Command to send a reload (during logrotate) is : timeout 60 bash -c "echo 'r' > /var/run/nginx/uwsgi.masterfifo"

Example of correct reload:

1476156371 - ...gracefully killing workers...
1476156371 - Gracefully killing worker 2 (pid: 4059)...
1476156371 - Gracefully killing worker 3 (pid: 4060)...
1476156371 - Gracefully killing worker 1 (pid: 4058)...
1476156371 - worker 1 buried after 0 seconds
1476156371 - worker 2 buried after 0 seconds
1476156371 - worker 3 buried after 0 seconds
1476156371 - binary reloading uWSGI...
1476156371 - chdir() to /
1476156371 - closing all non-uwsgi socket fds > 2 (max_fd = 1024)...
(..)

Example of failed reload:

1476674915 - ...gracefully killing workers...
1476674915 - Gracefully killing worker 1 (pid: 32767)...
1476674915 - Gracefully killing worker 2 (pid: 32768)...
1476674915 - Gracefully killing worker 3 (pid: 32769)...
1476674915 - uwsgi_master_fifo()/mkfifo(): File exists [core/fifo.c line 86]
1476674915 - VACUUM: pidfile removed.
1476674915 - VACUUM WARNING: unix socket /var/run/nginx/uwsgi.socket changed inode. Skip removal

If we look at core/fifo.c:86, we don't know if the unlink failed or the mkfifo, because we have none of the return codes are written into the logs. I think, at minimum, uWSGI should write return codes of system calls and system libraries.

xrmx commented 7 years ago

Please retry with latest stable release.

dennylab commented 5 years ago

Hi, I have the same issue.

Version:

python3 -m pip list | grep uWSGI
uWSGI                    2.0.15

I'm using the following to enable the uwsgi master fifo file: master-fifo=/tmp/uwsgi__lx__hydra__amdde_dev__dwe__dwe.fifo

When sending the chain reload instruction: echo c > ${file}

I can see in the logfiles that sometimes I get the error uwsgi_master_fifo()/mkfifo(): File exists [core/fifo.c line 110]

More complete log:

Mon Jul 29 10:13:20 2019 - chain reload starting...
Mon Jul 29 10:13:20 2019 - chain next victim is worker 1
Gracefully killing worker 1 (pid: 24937)...
uwsgi_master_fifo()/mkfifo(): File exists [core/fifo.c line 110]
07-29 10:13:20 - INFO   > amdde_dev_dwe_dwe_worker_1: Stopping external process...
Mon Jul 29 10:13:20 2019 - uWSGI worker 2 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Mon Jul 29 10:13:20 2019 - uWSGI worker 3 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Mon Jul 29 10:13:20 2019 - uWSGI worker 4 screams: UAAAAAAH my master disconnected: i will kill myself !!!
07-29 10:13:20 - INFO   > amdde_dev_dwe_dwe_worker_2: Stopping external process...
07-29 10:13:20 - INFO   > amdde_dev_dwe_dwe_worker_4: Stopping external process...
07-29 10:13:20 - INFO   > amdde_dev_dwe_dwe_worker_3: Stopping external process...

The main issue about this is that the master crashes brutally in the case of mkfifo(): File exists, so the uwsgi workers are loosing its connection to the master and killing themselves. Then, the workers are going crazy and are running on 100% CPU usage. See #296 (and maybe others).

Can't uwsgi just use the existing fifo file, or can I inject a command to remove that file if needed? In my opinion uwsgi could just use the file, because it's up to the developers/users to provide a unique file to each uwsgi!

Please help me. @xrmx

dennylab commented 5 years ago

It still happens with v2.0.18. @xrmx

palmkevin commented 5 years ago

We have encountered the same issue as reported by @labsolution. @xrmx is there are chance this issue will be handled in the next weeks/months. Can I somehow help to solve that issue?

awelzel commented 5 years ago

I've opened #2091 to log an unlink() error if it happens - maybe this can help to debug.

Scenarios that may cause uwsgi to fail removing/creating the fifo I can come up with, but it's widely guessing...

The stated echo c > ${file} will create a normal when the fifo doesn't exist. If this is done as root or privileged user, there may be permission issues. Maybe script should guard the invocation by adding a check for the named pipe: test -p ${file} && echo c > ${file}
echo c > ${file} isn't executed just once, but in a tight-ish loop or simultaneously from two independent scripts (two cron jobs running at the same time?).
Some interaction with systemd (or other supervisor/scripts) that deletes/creates or changes permissions/owner of the directory/master-fifo. At least the initial report mentioned /var/run/nginx/uwsgi.masterfifo which could be a systemd managed RuntimeDirectory of nginx.

@labsolution @palmkevin - is there any chance you could you check your environment? E.g, are you using systemd, other supervisor, cleanup scripts running, etc? Could you post the state of the filesystem (permissions, owners, types) when/after the issue occurred?

dennylab commented 5 years ago

Hello @awelzel ,

I installed the master version and did some tests.

I added the test command and it doesn't seem to help. However, it takes me 3-4 tries before I get the error. The first 1 and 2 times it works correctly. W or w/ the test command. Interestingly it's reproducable every 3-4 tries.

Log:

Mon Oct 28 08:45:07 2019 - chain reload starting... Mon Oct 28 08:45:07 2019 - chain next victim is worker 1 Gracefully killing worker 1 (pid: 1240)... 08:45:07 - INFO > amdde_dev_dwe_dwe_worker_1: Stopping external process... uwsgi_master_fifo()/mkfifo(): File exists [core/fifo.c line 112] Mon Oct 28 08:45:07 2019 - uWSGI worker 4 error: the master disconnected from this worker. Shutting down the worker. Mon Oct 28 08:45:07 2019 - uWSGI worker 2 error: the master disconnected from this worker. Shutting down the worker.

Note that I don't see your new unlink warning.

Version log:

Starting uWSGI 2.1-dev (64bit) on [Mon Oct 28 08:45:11 2019] compiled with version: 4.8.5 20150623 (Red Hat 4.8.5-36) on 28 October 2019 07:20:36 os: Linux-3.10.0-957.21.2.el7.x86_64 #1 SMP Tue May 28 09:26:43 UTC 2019 nodename: lsdev01 machine: x86_64 clock source: unix detected number of CPU cores: 8 current working directory: /home/lx

Fifo files after the error happened:

lsdev01(dwe/amdde_dev):/opt/ls/lx/release/dwe> ls -latr /tmp/uwsgilx*amdde_devdwedwe.fifo prw------- 1 lx lx 0 Oct 25 05:20 /tmp/uwsgilxsocket__amdde_devdwedwe.fifo prw------- 1 lx lx 0 Oct 28 08:51 /tmp/uwsgilxrestuiamdde_devdwedwe.fifo prw------- 1 lx lx 0 Oct 28 08:51 /tmp/uwsgilxhydraamdde_devdwedwe.fifo prw------- 1 lx lx 0 Oct 28 08:51 /tmp/uwsgilxrest__amdde_devdwe__dwe.fifo

I'm 99% sure that the echo is done only once for a file by me. Yes, we are in a loop but I execute the echo only once.

Script:

files=`ls /tmp/uwsgi__lx__*__${CUR_RELEASE}__*.fifo | grep "$1" | grep -v grep 2> /dev/null`
if [ "$?" -eq 0 ]; then
    # Advantage: Workers are reloaded one after the other, so no downtime and issue LX-9607 shoud be fixed
    echo "Using 'Chain reloading (lazy apps)' option to reload the processes..."

    is_named_pipe_being_read(){ /bin/echo unlocked 1<>"$1" >"$1"; }

    for file in ${files}; do
        echo "Using master fifo file ${file}..."

        if is_named_pipe_being_read ${file}; then
            echo c > ${file} # c = chain-reload, w = worker-reload
            if [ "$?" -eq 0 ]; then
                echo " (done)"
            else
                echo " (error)"
                exit 1;
            fi
        else
            echo " (no readers)"
        fi
    done
fi

Execution log:

Using 'Chain reloading (lazy apps)' option to reload the processes... Using master fifo file /tmp/uwsgilxhydraamdde_devdwedwe.fifo... (done) Using master fifo file /tmp/uwsgilxrest__amdde_devdwedwe.fifo... (done) Using master fifo file /tmp/uwsgilxrestuiamdde_devdwedwe.fifo... (done) Using master fifo file /tmp/uwsgilxsocket__amdde_devdwedwe.fifo... (no readers)

Thank you so much for your support!

awelzel commented 5 years ago

Hey @labsolution , thanks for the script, this helps to understand what's happening:

is_named_pipe_being_read(){ /bin/echo unlocked 1<>"$1" >"$1"; }

This writes "unlocked" into the master fifo and then right afterwards there is echo c > ${file} . If there's a uwsgi master listening, then the c in unlocked triggers the first chain reload. Then, echo c attempts to trigger the second one right afterwards - see point 2. in my previous reply.

I was able to reproduce it locally as follows (However: I did observe the master.fifo as a regular file - I'm assuming you restart uwsgi quickly when you observe it dead and that cleaned up and re-created the master fifo).

# The body of the while loop is what your script basically does.
$ while true ; do (/bin/echo unlocked 1<>/tmp/master.fifo > /tmp/master.fifo) && echo "c" > /tmp/master.fifo ; sleep 0.3  ; done 

$ ./uwsgi --master-fifo /tmp/master.fifo  --http-socket :9090 --workers 1  --plugin ./python37_plugin.so --wsgi-file tests/hello.py
...
Mon Oct 28 10:06:11 2019 - chain reloading complete           
Mon Oct 28 10:06:11 2019 - chain reload starting...              
Mon Oct 28 10:06:11 2019 - chain next victim is worker 1  
Gracefully killing worker 1 (pid: 9800)...                         
uwsgi_master_fifo()/mkfifo(): File exists [core/fifo.c line 112]

$ ls -lha /tmp/master.fifo
-rw-r--r-- 1 awelzel awelzel 2 Oct 28 10:07 /tmp/master.fifo

And that at least explains things for your.

Do you actually need the is_named_pipe_being_read()? Maybe the following would suffice to avoid blocking fifo that isn't read for too long?

if [ -p ${file}] ; then
    timeout 1 sh -c "echo c > ${file}"
fi

Probably protect the whole script with cronutils runlock or so, too https://github.com/google/cronutils

awelzel commented 5 years ago

PS:

I'm 99% sure that the echo is done only once for a file by me.

There was a 1% chance it's done more than once after all, right? :-)

dennylab commented 5 years ago

You are the man! Thank you so much!

New line: test -p ${file} && timeout 3 bash -c "echo c > ${file}"

You found the 1% :) That's why we should never say 100% :)

Feel free to add things to the doc, so others don't run into the same issues.

@palmkevin

unbit / uwsgi

Periodic error when reloading uwsgi with Master FIFO #1453