ralphlange / procServ

Wrapper to start arbitrary interactive commands in the background, with telnet or Unix domain socket access to stdin/stdout
GNU General Public License v3.0
23 stars 23 forks source link

procServ fails to start immediately after shutting it down #35

Open mark0n opened 4 years ago

mark0n commented 4 years ago
Jun 29 15:38:38 ls2-rf systemd[1]: Stopping EPICS Soft IOC ls2-rf-llrf...
Jun 29 15:38:38 ls2-rf systemd[1]: softioc-ls2-rf-llrf.service: Succeeded.
Jun 29 15:38:38 ls2-rf systemd[1]: Stopped EPICS Soft IOC ls2-rf-llrf.
Jun 29 15:38:38 ls2-rf systemd[1]: Starting EPICS Soft IOC ls2-rf-llrf...
Jun 29 15:38:38 ls2-rf procServ[1984]: Caught an exception creating the initial control telnet port: Bad file descriptor
Jun 29 15:38:38 ls2-rf procServ[1984]: /usr/bin/procServ: Exiting with error code: 98
Jun 29 15:38:38 ls2-rf systemd[1]: softioc-ls2-rf-llrf.service: Main process exited, code=exited, status=98/n/a
Jun 29 15:38:38 ls2-rf systemd[1]: softioc-ls2-rf-llrf.service: Failed with result 'exit-code'.
Jun 29 15:38:38 ls2-rf systemd[1]: Failed to start EPICS Soft IOC ls2-rf-llrf.

After a failure systemd automatically restarts the service and it succeeds on the second try. Here's my service file:

[Unit]
Description=EPICS Soft IOC ls2-rf-llrf
Requires=network.target
Wants=caRepeater.service
After=network.target caRepeater.service
RequiresMountsFor=

[Service]
Environment="AUTOSAVE_DIR=/var/lib/softioc-ls2-rf-llrf"
Environment="EPICS_CA_MAX_ARRAY_BYTES=8519680"
Environment="EPICS_CA_SEC_FILE=/usr/local/lib/iocapps/fe_ca_sec/rf.acf"
Environment="EPICS_IOC_LOG_INET=sys-logstash.cts"
Environment="EPICS_IOC_LOG_PORT=7005"
Environment="EPICS_PUT_LOG_INET=sys-logstash.cts"
Environment="EPICS_PUT_LOG_PORT=7004"
Environment="ETCD_SERVER=etcd.cts:2379"
EnvironmentFile=-/etc/iocs/ls2-rf-llrf/config
ExecStart=/usr/bin/procServ --foreground --quiet --chdir=/usr/local/lib/iocapps/ls2-rf-llrf/iocBoot/iocls2-rf-llrf --ignore=^C^D^] --coresize=10000000 --restrict --logfile=/var/log/softioc-ls2-rf-llrf/procServ.log --name ls2-rf-llrf --port 4051 --port unix:/run/softioc-ls2-rf-llrf/procServ.sock /usr/local/lib/iocapps/ls2-rf-llrf/iocBoot/iocls2-rf-llrf/st.cmd
Type=notify
NotifyAccess=all
Restart=always
User=softioc
RuntimeDirectory=softioc-ls2-rf-llrf

[Install]
WantedBy=multi-user.target

I'm using procServ 2.7.0-1 on Debian 10 "buster".

daykin commented 4 years ago

I am able to reproduce this on Buster without systemd on upstream 2.8.0. I built procServ in a directory adjacent to the IOC top, and ran this script in the procServ directory:

#!/usr/bin/env bash
echo "create a procServ instance, wait for it to spin up"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "kill (SIGTERM) and don't wait at all, should see error 98"
pkill procServ
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "start new process"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
pkill procServ
echo "kill, then wait 10s"
sleep 10
echo "start new process"
./procServ --quiet --chdir=../llrfioc/iocBoot/iocllrf  --coresize=10000000 --restrict --logfile=./procServ.log --name llrf --port 4051 ./st.cmd
sleep 5
echo "kill the new process that started correctly"
pkill  procServ
mark0n commented 3 years ago

@daykin after heading into the wrong direction for quite a while myself I realized that this is not be the right way to reproduce the problem since there's an important difference:

The following version of the script should behave more like systemd:

#!/usr/bin/bash
./procServ --foreground --restrict --logfile=./procServ.log --port 4051 sleep 100000
./procServ --foreground --restrict --logfile=./procServ.log --port 4051 sleep 100000

but it does not show the problem if I'm sending a SIGTERM (by running pkill -x procServ in a second terminal). I do see the problem when I run pkill -9 -x procServ, though.

Other findings:

mark0n commented 3 years ago

I only see this issue with a large IOC which is consuming multiple GB of RAM, multiple CPU cores and spawns hundreds of threads. The problem is not showing with small IOC on a single-core VM.