On a nimbus installation running 2.8, the services.log file contains a few instances of the following error:
/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,092 INFO workspace.WorkspaceUtil [ServiceThread-164,runCommand:154] [NIMBUS-EVENT][id-25]: /opt/bin/qdel 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:qdel: Server could not connect to MOM 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,107 ERROR pilot.PilotSlotManagement [ServiceThread-164,releaseSpaceImpl:1077] Problem calling Torque qdel: return code = 222, stderr = 'qdel: Server could not connect to MOM 4562168.moab01.**.**.**', no stdout
The workspace service removes its record of the VM, although the pilot job has not been successfully terminated.
Instead, I think Nimbus should probably retry the qdel a number of times, rather than simply logging an error. The current behaviour can leave zombie jobs in the PBS queue.
From a bug report by Sharon Goliath:
Instead, I think Nimbus should probably retry the qdel a number of times, rather than simply logging an error. The current behaviour can leave zombie jobs in the PBS queue.