Incorrect in_use in nimbus-nodes -l

oldpatricka commented 13 years ago

Occasionally, nimbus will report node being in_use: true when it is impossible for them to be in use. For example, no VMs are booted, and five nodes are in use.

This looks like:

[nimbus@calliopex ~]$ nimbus-nodes -l
hostname :  muse01.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse02.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse03.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse04.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse05.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

[nimbus@calliopex ~]$ cat /usr/local/nimbus/.../current-reservations.txt
[nimbus@calliopex ~]$

Here is a gist of the last few days of my nimbus log:

https://gist.github.com/954464

labisso commented 13 years ago

I'm fairly confident this bug has been fixed. The problem was that in certain error scenarios, the scheduler was not returning VMM memory allocations back to the pool. Specifically if a request failed because of a network binding error (not enough available IP addresses perhaps).

The fix ended up being somewhat hairy and I'd like @timf to review if possible. It is also important that the pilot is tested before a release.

oldpatricka commented 13 years ago

This seemed to work fine with pilot. qdel got called, which maybe didn't happen before?

2011-05-27 15:28:50,035 INFO  defaults.CreationManagerImpl [ServiceThread-26,create:362] [NIMBUS-EVENT]: Create request for instance from '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'
2011-05-27 15:28:50,040 INFO  groupauthz.Group [ServiceThread-26,decide:290] 

Considering caller: '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'.
Current elapsed minutes: 54.
Current reserved minutes: 500.
Number of VMs in request: 1.
Charge ratio for request: 1.0.
Number of VMs caller is already currently running: 1.
Rights:
GroupRights for group 'TESTING':  {maxReservedMinutes=0, maxElapsedReservedMinutes=0, maxWorkspaceNumber=5, maxWorkspacesInGroup=1, imageNodeHostname='example.com', imageBaseDirectory='/cloud', dirHashMode=true, maxCPUs=2}

Duration request: 500

2011-05-27 15:28:50,060 INFO  pilot.PilotSlotManagement [ServiceThread-26,reserveSpaceImpl:804] pilot command = /opt/nimbus/bin/workspacepilot.py -t --reserveslot -m 512 -d 30002 -g 8 -i ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6 -c http://calliopex.phys.uvic.ca:41999/pilot_notification/v01/
2011-05-27 15:28:50,060 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT]: qsub -j oe -r n -m n -l nodes=1:ppn=1 -l walltime=08:20:02 -l mem=512mb -o /usr/local/nimbus/services/var/nimbus/pilot-logs/ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6 
2011-05-27 15:28:50,082 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT]: Return code is 0
2011-05-27 15:28:50,083 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:270] [NIMBUS-EVENT]: 
STDOUT:
6831.calliopex.phys.uvic.ca
2011-05-27 15:28:50,091 ERROR defaults.Util [ServiceThread-26,getNextEntry:88] network 'public' is not currently available
2011-05-27 15:28:50,097 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT][id-25]: qdel 6831.calliopex.phys.uvic.ca 
2011-05-27 15:28:50,118 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT][id-25]: Return code is 0
2011-05-27 15:28:50,126 ERROR factory.FactoryService [ServiceThread-26,create:109] Error creating workspace(s): network 'public' is not currently available

nimbusproject / nimbus

Incorrect in_use in nimbus-nodes -l #46