oar-team / oar

OAR is a versatile resource and task manager (also called a batch scheduler) for clusters and other computing infrastructures.
http://oar.imag.fr/
GNU General Public License v2.0
43 stars 22 forks source link

oar server service restart doesn't work properly #112

Open bpichot opened 8 years ago

bpichot commented 8 years ago

Sometimes on Grid'5000, after a restart oar gets blocked and doesn't schedule jobs anymore.

It seems that some processes are not killed: after a "stop" we can get this message: Stopping OAR server:/sbin/start-stop-daemon: warning: failed to kill 31759: No such process

but 'ps' still returns: oar 28822 0.0 0.1 60460 10912 ? S Apr14 0:01 /usr/bin/perl /usr/lib/oar/Almighty oar 28823 0.0 0.1 60460 11640 ? S Apr14 0:32 Almighty: appendice oar 28825 0.0 0.1 60460 11652 ? S Apr14 1:31 Almighty: bipbip oar 28207 0.0 0.2 116756 22452 ? S 15:23 0:02 Almighty: hulot oar 31918 0.0 0.0 0 0 ? Z 15:44 0:00 [Almighty: hulot]

npf commented 7 years ago

@lnussbaum please.

lnussbaum commented 7 years ago

@npf what are you expecting me to do here? this problem seems to be related to oar's init script. In the mid/long-term, it could be replaced by a systemd unit file. But this won't solve G5K's problem as it's still running on wheezy.

Also, what would be the best way to work on this? do you have a vagrant env?

npf commented 7 years ago

Could you help us with the systemd unit file ? Target is Debian Stretch and Jessie-backports first of all.

We have vagrant testboxes with the oar-vagrant project. It should be fairly easy to work with it. https://oar.imag.fr/wiki:oar-vagrant

lnussbaum commented 6 years ago

very basic service file that should work:

[Unit]
Description=OAR server
Documentation=man:oar-server(1)
After=network-online.target
After=remote-fs.target
After=postgresql.service
After=mysql-server.service
Wants=network-online.target

[Service]
ExecStart=/usr/sbin/oar-server

[Install]
WantedBy=multi-user.target

(not tested)

to test: