grid proxy valid but associated VM in "NoProxy" state.

A VM is started for a spooled job specfying valid x509userproxy, +CSMyProxyServer, and +CSMyProxyCredsName attributes. Job completes and output is condor_transfer_data to the user workstation. User then issue condor_rm for the completed, output retrieved job that started the VM. Between the first job and the condor_rm, other jobs may run on the VM. However, once the VM becomes idle, any attempt by CS to shutdown the VM will result in the following CS log messages:

2011-06-16 14:41:17,907 - INFO - Cleanup - VM of type canfarbase id 285 no longe2011-06-16 14:41:17,907 - INFO - Cleanup - VM of type canfarbase id 285 no longer required - shutting down 2011-06-16 14:41:19,401 - DEBUG - Scheduler - ### Scheduler Cycle: 2011-06-16 14:41:19,408 - DEBUG - Scheduler - Schedule any high priority jobs 2011-06-16 14:41:19,415 - DEBUG - Scheduler - Scheduler - Waiting 10s 2011-06-16 14:41:19,921 - DEBUG - Cleanup - (vm_destroy) - VM shutdown request failed, moving directly to destroy. 2011-06-16 14:41:21,941 - WARNING - Cleanup - VM 285 was not correctly destroyed: Destroying workspace 285 @ "https://elephant.heprc.uvic.ca:8443/wsrf/services/WorkspaceService"... Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

2011-06-16 14:41:21,948 - ERROR - Cleanup - (cleanup) - Destroying VM failed in attempt to clear uneeded VM. Leaving VM. r required - shutting down 2011-06-16 14:41:19,401 - DEBUG - Scheduler - ### Scheduler Cycle: 2011-06-16 14:41:19,408 - DEBUG - Scheduler - Schedule any high priority jobs 2011-06-16 14:41:19,415 - DEBUG - Scheduler - Scheduler - Waiting 10s 2011-06-16 14:41:19,921 - DEBUG - Cleanup - (vm_destroy) - VM shutdown request failed, moving directly to destroy. 2011-06-16 14:41:21,941 - WARNING - Cleanup - VM 285 was not correctly destroyed: Destroying workspace 285 @ "https://elephant.heprc.uvic.ca:8443/wsrf/services/WorkspaceService"... Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

2011-06-16 14:41:21,948 - ERROR - Cleanup - (cleanup) - Destroying VM failed in attempt to clear uneeded VM. Leaving VM.

Hmmm, that's weird. According to the error message below, it looks like the VM cached proxy is not used for the shutdown (it seems to be using the job's proxy). After looking at the CS config file, it looks like the proxy_cache_dir is not set, and the code does not have any default value for that variable, which is probably causing this problem. I'll fix this...

Andre

On Thu, Jun 16, 2011 at 6:11 PM, crlb < reply@reply.github.com>wrote:

A VM is started for a spooled job specfying valid x509userproxy, +CSMyProxyServer, and +CSMyProxyCredsName attributes. Job completes and output is condor_transfer_data to the user workstation. User then issue condor_rm for the completed, output retrieved job that started the VM. Between the first job and the condor_rm, other jobs may run on the VM. However, once the VM becomes idle, any attempt by CS to shutdown the VM will result in the following CS log messages:

2011-06-16 14:41:17,907 - INFO - Cleanup - VM of type canfarbase id 285 no longe2011-06-16 14:41:17,907 - INFO - Cleanup - VM of type canfarbase id 285 no longer required - shutting down 2011-06-16 14:41:19,401 - DEBUG - Scheduler - ### Scheduler Cycle: 2011-06-16 14:41:19,408 - DEBUG - Scheduler - Schedule any high priority jobs 2011-06-16 14:41:19,415 - DEBUG - Scheduler - Scheduler - Waiting 10s 2011-06-16 14:41:19,921 - DEBUG - Cleanup - (vm_destroy) - VM shutdown request failed, moving directly to destroy. 2011-06-16 14:41:21,941 - WARNING - Cleanup - VM 285 was not correctly destroyed: Destroying workspace 285 @ " https://elephant.heprc.uvic.ca:8443/wsrf/services/WorkspaceService"... Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

2011-06-16 14:41:21,948 - ERROR - Cleanup - (cleanup) - Destroying VM failed in attempt to clear uneeded VM. Leaving VM. r required - shutting down 2011-06-16 14:41:19,401 - DEBUG - Scheduler - ### Scheduler Cycle: 2011-06-16 14:41:19,408 - DEBUG - Scheduler - Schedule any high priority jobs 2011-06-16 14:41:19,415 - DEBUG - Scheduler - Scheduler - Waiting 10s 2011-06-16 14:41:19,921 - DEBUG - Cleanup - (vm_destroy) - VM shutdown request failed, moving directly to destroy. 2011-06-16 14:41:21,941 - WARNING - Cleanup - VM 285 was not correctly destroyed: Destroying workspace 285 @ " https://elephant.heprc.uvic.ca:8443/wsrf/services/WorkspaceService"... Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

2011-06-16 14:41:21,948 - ERROR - Cleanup - (cleanup) - Destroying VM failed in attempt to clear uneeded VM. Leaving VM.

Reply to this email directly or view it on GitHub: https://github.com/hep-gc/cloud-scheduler/issues/194

OK, looks like I found the source of the problem. It seems that the proxy caching call for that VM failed cause it could not find the original proxy used to submit the job:

2011-06-16 14:15:47,298 - ERROR - Scheduler - Problem caching proxy. Continuing without Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/cloud_scheduler-0.13-py2.7.egg/cloudscheduler/cluster_tools.py", line 599, in vm_create job_proxy_file_path = self._cache_proxy(job_proxy_file_path) File "/usr/local/lib/python2.7/site-packages/cloud_scheduler-0.13-py2.7.egg/cloudscheduler/cluster_tools.py", line 979, in _cache_proxy shutil.copy2(proxy_file_path, tmp_proxy_file_path) File "/usr/local/lib/python2.7/shutil.py", line 127, in copy2 copyfile(src, dst) File "/usr/local/lib/python2.7/shutil.py", line 81, in copyfile with open(src, 'rb') as fsrc: IOError: [Errno 2] No such file or directory: '/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795'

Investigating...

On Mon, Jun 20, 2011 at 10:46 AM, Andre Charbonneau 4433258@gmail.comwrote:

Hmmm, that's weird. According to the error message below, it looks like the VM cached proxy is not used for the shutdown (it seems to be using the job's proxy). After looking at the CS config file, it looks like the proxy_cache_dir is not set, and the code does not have any default value for that variable, which is probably causing this problem. I'll fix this...

Andre

On Thu, Jun 16, 2011 at 6:11 PM, crlb < reply@reply.github.com>wrote:

A VM is started for a spooled job specfying valid x509userproxy, +CSMyProxyServer, and +CSMyProxyCredsName attributes. Job completes and output is condor_transfer_data to the user workstation. User then issue condor_rm for the completed, output retrieved job that started the VM. Between the first job and the condor_rm, other jobs may run on the VM. However, once the VM becomes idle, any attempt by CS to shutdown the VM will result in the following CS log messages:

2011-06-16 14:41:17,907 - INFO - Cleanup - VM of type canfarbase id 285 no longe2011-06-16 14:41:17,907 - INFO - Cleanup - VM of type canfarbase id 285 no longer required - shutting down 2011-06-16 14:41:19,401 - DEBUG - Scheduler - ### Scheduler Cycle: 2011-06-16 14:41:19,408 - DEBUG - Scheduler - Schedule any high priority jobs 2011-06-16 14:41:19,415 - DEBUG - Scheduler - Scheduler - Waiting 10s 2011-06-16 14:41:19,921 - DEBUG - Cleanup - (vm_destroy) - VM shutdown request failed, moving directly to destroy. 2011-06-16 14:41:21,941 - WARNING - Cleanup - VM 285 was not correctly destroyed: Destroying workspace 285 @ " https://elephant.heprc.uvic.ca:8443/wsrf/services/WorkspaceService"... Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

2011-06-16 14:41:21,948 - ERROR - Cleanup - (cleanup) - Destroying VM failed in attempt to clear uneeded VM. Leaving VM. r required - shutting down 2011-06-16 14:41:19,401 - DEBUG - Scheduler - ### Scheduler Cycle: 2011-06-16 14:41:19,408 - DEBUG - Scheduler - Schedule any high priority jobs 2011-06-16 14:41:19,415 - DEBUG - Scheduler - Scheduler - Waiting 10s 2011-06-16 14:41:19,921 - DEBUG - Cleanup - (vm_destroy) - VM shutdown request failed, moving directly to destroy. 2011-06-16 14:41:21,941 - WARNING - Cleanup - VM 285 was not correctly destroyed: Destroying workspace 285 @ " https://elephant.heprc.uvic.ca:8443/wsrf/services/WorkspaceService"... Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

Problem: General error: ; nested exception is: GSSException: Defective credential detected [Caused by: Proxy file (/var/lib/condor/spool/3854/0/cluster3854.proc0.subproc0/x509up_u10795) not found.]

2011-06-16 14:41:21,948 - ERROR - Cleanup - (cleanup) - Destroying VM failed in attempt to clear uneeded VM. Leaving VM.

Reply to this email directly or view it on GitHub: https://github.com/hep-gc/cloud-scheduler/issues/194

After further investigation into this issue, this is what I think is triggering this problem:

user submits job
CS issues nimbus workspace create command
before nimbus workspace create command returns, user deletes job from queue; job proxy is destroyed
nimbus workspace create call returns
CS tries to make a cached copy of the job proxy; call fails cause the job has been removed from the queue by the user
VM runs with no cached proxy
VM shutdown fails cause it cannot reference a valid proxy

To fix this, I need to move the proxy caching to happen right before the nimbus workspace create call, and make sure I clean up if the nimbus workspace create call fails.

Pushed a fix for this to the dev branch. Now create the cached proxy just before the VM is created, as opposed to right after.

https://github.com/hep-gc/cloud-scheduler/commit/140c2810de16d48f1a353ce2d99033adf67eccd0

Thanks Andre. CS is in use right now, so we'll schedule an upgrade at tomorrow's meeting. Mike also has some changes re tracking owner/vmtype and we need both of those on. Colin.

AndreCharbonneau wrote:

Pushed a fix for this to the dev branch. Now create the cached proxy just before the VM is created, as opposed to right after.

https://github.com/hep-gc/cloud-scheduler/commit/140c2810de16d48f1a353ce2d99033adf67eccd0

Colin Leavett-Brown Department of Physics & Astronomy University of Victoria 250-721-7728

hep-gc / cloud-scheduler

grid proxy valid but associated VM in "NoProxy" state. #194