Open bpichot opened 8 years ago
@lnussbaum please.
@npf what are you expecting me to do here? this problem seems to be related to oar's init script. In the mid/long-term, it could be replaced by a systemd unit file. But this won't solve G5K's problem as it's still running on wheezy.
Also, what would be the best way to work on this? do you have a vagrant env?
Could you help us with the systemd unit file ? Target is Debian Stretch and Jessie-backports first of all.
We have vagrant testboxes with the oar-vagrant project. It should be fairly easy to work with it. https://oar.imag.fr/wiki:oar-vagrant
very basic service file that should work:
[Unit]
Description=OAR server
Documentation=man:oar-server(1)
After=network-online.target
After=remote-fs.target
After=postgresql.service
After=mysql-server.service
Wants=network-online.target
[Service]
ExecStart=/usr/sbin/oar-server
[Install]
WantedBy=multi-user.target
(not tested)
to test:
Sometimes on Grid'5000, after a restart oar gets blocked and doesn't schedule jobs anymore.
It seems that some processes are not killed: after a "stop" we can get this message: Stopping OAR server:/sbin/start-stop-daemon: warning: failed to kill 31759: No such process
but 'ps' still returns: oar 28822 0.0 0.1 60460 10912 ? S Apr14 0:01 /usr/bin/perl /usr/lib/oar/Almighty oar 28823 0.0 0.1 60460 11640 ? S Apr14 0:32 Almighty: appendice oar 28825 0.0 0.1 60460 11652 ? S Apr14 1:31 Almighty: bipbip oar 28207 0.0 0.2 116756 22452 ? S 15:23 0:02 Almighty: hulot oar 31918 0.0 0.0 0 0 ? Z 15:44 0:00 [Almighty: hulot]