Deleting exec-host with jobs in 'dr' state is allowed

danpovey commented 5 years ago

See email chain pasted below. The basic issue, I believe, is that you can do qconf -de some_host when there are jobs in state 'dr' on that host. That crashes the gridengine master, and restarting it is not possible: message in /var/spool/gridengine/qmaster/messages is: 11/10/2018 16:23:27| main|deb8qmaster|C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

I'm not sure which part of the code deals with this; it should probably be fixed.

I was able to fix it, although I suspect that my fix may have been disruptive to the jobs.

Firstly, I  believe the problem was that gridengine does not handle a deleted job that is on a host that has been deleted, and it dies when it sees it.   Presumably the bug is in allowing it to be deleted in the first place.

Anyway, my fix (after backing up the directory /var/spool/gridengine) was to move the file /var/spool/gridengine/spooldb/sge_job to a temporary location, restart the qmaster, add the host back with qconf -ah, stop the qmaster, restore the old database  /var/spool/gridengine/spooldb/sge_job, and restart the qmaster.

Before doing that whole procedure, to stop the hosts getting confused I stopped all the gridengine-exec services.  That probably wasn't optimal because clients like qsub and qstat would still have been able to access the queue in the interim, and it definitely would have confused them and killed some processes.  Unfortunately I had to do this on short notice and wasn't sure how to use iptables to close off those ports from outside the qmaster while I did the maintenance-- that would have been a better solution. 

Also I encountered a hiccup that `systemctl stop gridengine-qmaster` didn't actually work the second time, the process was still running, with the old database, so I had to manually kill it and retry.

Anyway this whole episode is making me think more seriously about moving to Univa GridEngine.  I've known for a long time that the free version has a lot of bugs, and I just don't have time to deal with this type of thing.

On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) <john.marshall2@canada.ca> wrote:
Hi,

I've never seen this but I would start with:
1) strace qmaster during restart to try to see at which point it is dying (e.g.,
loading a config file)
2) look for any reference to the name of the host you deleted in the spool
area and do some cleanup
3) clean out the jobs spool area

HTH,
John

On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote:
Has anyone found this error, and managed to fix it?
I am in a very difficult situation.
I deleted a host (qconf -de hostname) thinking that the machine no longer existed, but it did exist, and there was a job in 'dr' state there.
After I attempted to force-delete that job (qdel -f job-id), the queue master died with out-of-memory, and now I can't restart qmaster.

So now I don't know hw to fix it.  Am I just completely lost now?

Dan
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Kunzol commented 5 years ago

I have no bugfix, but maybe a hint how to get the system up and running again.

The SGE uses a database to store the runtime information. There are two possibilities: Files and BerkeleyDB.

Make a backup, before you peek around in the config!!!

The Files-DB can easily be edited and the nodes can be deleted manually with a text editor.

With BerkeleyDB you can use some of the tools which are installed together with SGE in "utilbin", or generic BerkeleyBD utils (usually it is hard to find the correct version).

Hope this helps to bring the SGE in a running state.

danpovey commented 5 years ago

Thanks... I managed to fix my problem by deleting the jobs db after backing it up, restarting the master, re-adding the node whose deletion caused the problem, then stopping the master, copying back the old jobs db, and restarting the master. This would have confused clients, except I stopped their daemons first.

Not ideal, of course.

I believe the bug is in allowing a node with 'dr' jobs to be deleted. Shouldn't be hard to fix if someone knows the code.

On Thu, Nov 15, 2018 at 2:02 AM Marco Schmidt notifications@github.com wrote:

I have no bugfix, but maybe a hint how to get the system up and running again.

The SGE uses a database to store the runtime information. There are two possibilities: Files and BerkeleyDB.

Make a backup, before you peek around in the config!!!

The Files-DB can easily be edited and the nodes can be deleted manually with a text editor.

With BerkeleyDB you can use some of the tools which are installed together with SGE in "utilbin", or generic BerkeleyBD utils (usually it is hard to find the correct version).

Hope this helps to bring the SGE in a running state.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/son-of-gridengine/sge/issues/9#issuecomment-438938807, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuySBX_P7tvjXBrziS00slcmcwVJHks5uvRF0gaJpZM4YYn4d .

son-of-gridengine / sge

Deleting exec-host with jobs in 'dr' state is allowed #9