Open mark0n opened 4 years ago
I am able to reproduce this on Buster without systemd on upstream 2.8.0. I built procServ in a directory adjacent to the IOC top, and ran this script in the procServ directory:
#!/usr/bin/env bash
echo "create a procServ instance, wait for it to spin up"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "kill (SIGTERM) and don't wait at all, should see error 98"
pkill procServ
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "start new process"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
pkill procServ
echo "kill, then wait 10s"
sleep 10
echo "start new process"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "kill the new process that started correctly"
pkill procServ
@daykin after heading into the wrong direction for quite a while myself I realized that this is not be the right way to reproduce the problem since there's an important difference:
pkill
is sending a SIGTERM to procServ. pkill
returns immediately (without waiting for procServ to shut down). Looking at the procServ code there are a few things that need to be done during shut down so this can take a short time. In the mean time the script is starting another procServ instance which for obvious reasons complains about the port being in use.The following version of the script should behave more like systemd:
#!/usr/bin/bash
./procServ --foreground --restrict --logfile=./procServ.log --port 4051 sleep 100000
./procServ --foreground --restrict --logfile=./procServ.log --port 4051 sleep 100000
but it does not show the problem if I'm sending a SIGTERM (by running pkill -x procServ
in a second terminal). I do see the problem when I run pkill -9 -x procServ
, though.
Other findings:
watch -n .1 'netstat -an | grep 4051'
only shows TIME_WAIT
when a connection is established before terminating procServ; it stays in this lingering state for 120 s and thanks to SO_REUSEADDR
I can restart procServ without any problems within this time. If no connection is established the port immediately disappears from the output of netstat when procServ is terminated. The problem can only be observed if procServ is restarted within a split second after terminating it. I was not able to observe any change in state in the relevant time interval - but considering the very short time the problem shows that's certainly not conclusive.I only see this issue with a large IOC which is consuming multiple GB of RAM, multiple CPU cores and spawns hundreds of threads. The problem is not showing with small IOC on a single-core VM.
After a failure systemd automatically restarts the service and it succeeds on the second try. Here's my service file:
I'm using procServ 2.7.0-1 on Debian 10 "buster".