procServ fails to start immediately after shutting it down

mark0n commented 4 years ago

Jun 29 15:38:38 ls2-rf systemd[1]: Stopping EPICS Soft IOC ls2-rf-llrf...
Jun 29 15:38:38 ls2-rf systemd[1]: softioc-ls2-rf-llrf.service: Succeeded.
Jun 29 15:38:38 ls2-rf systemd[1]: Stopped EPICS Soft IOC ls2-rf-llrf.
Jun 29 15:38:38 ls2-rf systemd[1]: Starting EPICS Soft IOC ls2-rf-llrf...
Jun 29 15:38:38 ls2-rf procServ[1984]: Caught an exception creating the initial control telnet port: Bad file descriptor
Jun 29 15:38:38 ls2-rf procServ[1984]: /usr/bin/procServ: Exiting with error code: 98
Jun 29 15:38:38 ls2-rf systemd[1]: softioc-ls2-rf-llrf.service: Main process exited, code=exited, status=98/n/a
Jun 29 15:38:38 ls2-rf systemd[1]: softioc-ls2-rf-llrf.service: Failed with result 'exit-code'.
Jun 29 15:38:38 ls2-rf systemd[1]: Failed to start EPICS Soft IOC ls2-rf-llrf.

After a failure systemd automatically restarts the service and it succeeds on the second try. Here's my service file:

[Unit]
Description=EPICS Soft IOC ls2-rf-llrf
Requires=network.target
Wants=caRepeater.service
After=network.target caRepeater.service
RequiresMountsFor=

[Service]
Environment="AUTOSAVE_DIR=/var/lib/softioc-ls2-rf-llrf"
Environment="EPICS_CA_MAX_ARRAY_BYTES=8519680"
Environment="EPICS_CA_SEC_FILE=/usr/local/lib/iocapps/fe_ca_sec/rf.acf"
Environment="EPICS_IOC_LOG_INET=sys-logstash.cts"
Environment="EPICS_IOC_LOG_PORT=7005"
Environment="EPICS_PUT_LOG_INET=sys-logstash.cts"
Environment="EPICS_PUT_LOG_PORT=7004"
Environment="ETCD_SERVER=etcd.cts:2379"
EnvironmentFile=-/etc/iocs/ls2-rf-llrf/config
ExecStart=/usr/bin/procServ --foreground --quiet --chdir=/usr/local/lib/iocapps/ls2-rf-llrf/iocBoot/iocls2-rf-llrf --ignore=^C^D^] --coresize=10000000 --restrict --logfile=/var/log/softioc-ls2-rf-llrf/procServ.log --name ls2-rf-llrf --port 4051 --port unix:/run/softioc-ls2-rf-llrf/procServ.sock /usr/local/lib/iocapps/ls2-rf-llrf/iocBoot/iocls2-rf-llrf/st.cmd
Type=notify
NotifyAccess=all
Restart=always
User=softioc
RuntimeDirectory=softioc-ls2-rf-llrf

[Install]
WantedBy=multi-user.target

I'm using procServ 2.7.0-1 on Debian 10 "buster".

daykin commented 4 years ago

I am able to reproduce this on Buster without systemd on upstream 2.8.0. I built procServ in a directory adjacent to the IOC top, and ran this script in the procServ directory:

#!/usr/bin/env bash
echo "create a procServ instance, wait for it to spin up"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "kill (SIGTERM) and don't wait at all, should see error 98"
pkill procServ
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "start new process"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
pkill procServ
echo "kill, then wait 10s"
sleep 10
echo "start new process"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "kill the new process that started correctly"
pkill  procServ

mark0n commented 3 years ago

@daykin after heading into the wrong direction for quite a while myself I realized that this is not be the right way to reproduce the problem since there's an important difference:

Your script is sending procServ to the background (it forks), then pkill is sending a SIGTERM to procServ. pkill returns immediately (without waiting for procServ to shut down). Looking at the procServ code there are a few things that need to be done during shut down so this can take a short time. In the mean time the script is starting another procServ instance which for obvious reasons complains about the port being in use.
systemd starts procServ in the foreground. When we ask it to restart procServ it should send a SIGTERM and wait for procServ to shut down. If procServ doesn't shut down within 90 s systemd will send a SIGKILL. systemd should not start a new instance before the old one has shut down completely. I'm wondering how we can verify that systemd is actually in line with these assumptions.

The following version of the script should behave more like systemd:

#!/usr/bin/bash
./procServ --foreground --restrict --logfile=./procServ.log --port 4051 sleep 100000
./procServ --foreground --restrict --logfile=./procServ.log --port 4051 sleep 100000

but it does not show the problem if I'm sending a SIGTERM (by running pkill -x procServ in a second terminal). I do see the problem when I run pkill -9 -x procServ, though.

Other findings:

The issue seems to be unrelated to the socket being left in TIME_WAIT state. watch -n .1 'netstat -an | grep 4051' only shows TIME_WAIT when a connection is established before terminating procServ; it stays in this lingering state for 120 s and thanks to SO_REUSEADDR I can restart procServ without any problems within this time. If no connection is established the port immediately disappears from the output of netstat when procServ is terminated. The problem can only be observed if procServ is restarted within a split second after terminating it. I was not able to observe any change in state in the relevant time interval - but considering the very short time the problem shows that's certainly not conclusive.
It always succeeds when using a Unix domain socket only so the issue seems to be specific to network sockets.
It succeeds when using different port numbers for the two consecutive procServ calls.
After reading How do SO_REUSEADDR and SO_REUSEPORT differ? I'm convinced that the current implementation which uses SO_REUSEADDR but not SO_REUSEPORT is what we want. The fix for an old bug reported by @MarkRivers looks correct to me.

mark0n commented 3 years ago

I only see this issue with a large IOC which is consuming multiple GB of RAM, multiple CPU cores and spawns hundreds of threads. The problem is not showing with small IOC on a single-core VM.

ralphlange / procServ

procServ fails to start immediately after shutting it down #35