nimbusproject / nimbus

Nimbus - Open Source Cloud Computing Software - 100% Apache2 licensed
http://www.nimbusproject.org/
197 stars 82 forks source link

Nimbus allows qdels to fail in Pilot #87

Open oldpatricka opened 12 years ago

oldpatricka commented 12 years ago

From a bug report by Sharon Goliath:

On a nimbus installation running 2.8, the services.log file contains a few instances of the following error:

/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,092 INFO  workspace.WorkspaceUtil     [ServiceThread-164,runCommand:154] [NIMBUS-EVENT][id-25]: /opt/bin/qdel 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:qdel: Server could not connect to MOM 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,107 ERROR pilot.PilotSlotManagement     [ServiceThread-164,releaseSpaceImpl:1077] Problem calling Torque qdel: return code = 222, stderr = 'qdel: Server could not     connect to MOM 4562168.moab01.**.**.**', no stdout

The workspace service removes its record of the VM, although the pilot job has not been successfully terminated.

Instead, I think Nimbus should probably retry the qdel a number of times, rather than simply logging an error. The current behaviour can leave zombie jobs in the PBS queue.