Closed oldpatricka closed 13 years ago
I'm fairly confident this bug has been fixed. The problem was that in certain error scenarios, the scheduler was not returning VMM memory allocations back to the pool. Specifically if a request failed because of a network binding error (not enough available IP addresses perhaps).
The fix ended up being somewhat hairy and I'd like @timf to review if possible. It is also important that the pilot is tested before a release.
This seemed to work fine with pilot. qdel got called, which maybe didn't happen before?
2011-05-27 15:28:50,035 INFO defaults.CreationManagerImpl [ServiceThread-26,create:362] [NIMBUS-EVENT]: Create request for instance from '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'
2011-05-27 15:28:50,040 INFO groupauthz.Group [ServiceThread-26,decide:290]
Considering caller: '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'.
Current elapsed minutes: 54.
Current reserved minutes: 500.
Number of VMs in request: 1.
Charge ratio for request: 1.0.
Number of VMs caller is already currently running: 1.
Rights:
GroupRights for group 'TESTING': {maxReservedMinutes=0, maxElapsedReservedMinutes=0, maxWorkspaceNumber=5, maxWorkspacesInGroup=1, imageNodeHostname='example.com', imageBaseDirectory='/cloud', dirHashMode=true, maxCPUs=2}
Duration request: 500
2011-05-27 15:28:50,060 INFO pilot.PilotSlotManagement [ServiceThread-26,reserveSpaceImpl:804] pilot command = /opt/nimbus/bin/workspacepilot.py -t --reserveslot -m 512 -d 30002 -g 8 -i ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6 -c http://calliopex.phys.uvic.ca:41999/pilot_notification/v01/
2011-05-27 15:28:50,060 INFO workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT]: qsub -j oe -r n -m n -l nodes=1:ppn=1 -l walltime=08:20:02 -l mem=512mb -o /usr/local/nimbus/services/var/nimbus/pilot-logs/ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6
2011-05-27 15:28:50,082 INFO workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT]: Return code is 0
2011-05-27 15:28:50,083 INFO workspace.WorkspaceUtil [ServiceThread-26,runCommand:270] [NIMBUS-EVENT]:
STDOUT:
6831.calliopex.phys.uvic.ca
2011-05-27 15:28:50,091 ERROR defaults.Util [ServiceThread-26,getNextEntry:88] network 'public' is not currently available
2011-05-27 15:28:50,097 INFO workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT][id-25]: qdel 6831.calliopex.phys.uvic.ca
2011-05-27 15:28:50,118 INFO workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT][id-25]: Return code is 0
2011-05-27 15:28:50,126 ERROR factory.FactoryService [ServiceThread-26,create:109] Error creating workspace(s): network 'public' is not currently available
Occasionally, nimbus will report node being in_use: true when it is impossible for them to be in use. For example, no VMs are booted, and five nodes are in use.
This looks like:
Here is a gist of the last few days of my nimbus log:
https://gist.github.com/954464