Open mturilli opened 10 years ago
You will see this also in the log when the pilots time out before the cancel is called -- which happens sometimes. I'll check if this is the case.
I assume the ticket is about sagapilot, right?
Unfortunately, I don't know where the error is but I see it from TROY so I opened a ticket here. These are the last lines of the logs:
2014:02:17 00:01:31 MainThread troy.logger : [INFO ]
workload_1 done
2014:02:17 00:01:31 MainThread troy.logger : [INFO ]
workload_2 done
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
start : cancel [] : 2014-02-17 05:01:31.371043 (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
stop : cancel [] : 2014-02-17 05:01:31.371375 (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
duration : cancel [] : 0.000332 sec
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
start : cancel [] : 2014-02-17 05:01:31.371849 (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
stop : cancel [] : 2014-02-17 05:01:31.372102 (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
duration : cancel [] : 0.000253 sec
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
start : cancel [] : 2014-02-17 05:01:31.372594 (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
event [p.0001] : start : ['sinon'] (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
event [p.0001] : state_detail : ['sinon', u"Created agent directory
'sftp://
india.futuregrid.org//N/u/mturilli/troy_agents/pilot-5301975e3cf749400de16537/'."]
(UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
event [p.0001] : state_detail : ['sinon', u"Copied
'file://localhost//Users/mturilli/Virtualenvs/TROY_master/bin/bootstrap-and-run-agent'
script to agent directory."] (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
event [p.0001] : state_detail : ['sinon', u"Copied
'file://localhost//Users/mturilli/Virtualenvs/TROY_master/lib/python2.7/site-packages/sagapilot/agent/sagapilot-agent.py'
script to agent directory."] (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
event [p.0001] : state_detail : ['sinon', u"ComputePilot agent
successfully submitted with JobID
'[pbs+ssh://india.futuregrid.org]-[1470595]'"]
(UTC)
2014:02:17 00:01:31 MainThread troy.logger : [INFO ] cancel
pilot p.0001
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
stop : cancel [] : 2014-02-17 05:01:31.378082 (UTC)
2014:02:17 00:01:31 MainThread troy.logger : [DEBUG ] timed
duration : cancel [] : 0.005488 sec
This line:
2014:02:17 00:01:31 MainThread troy.logger : [INFO ] cancel
pilot p.0001
Seems to indicate that TROY think that the pilot has been cancel but there are no complaints/logs from sagapilot to show that such a request has been honored.
On Mon, Feb 17, 2014 at 2:23 AM, Andre Merzky notifications@github.comwrote:
You will see this also in the log when the pilots time out before the cancel is called -- which happens sometimes. I'll check if this is the case.
I assume the ticket is about sagapilot, right?
Reply to this email directly or view it on GitHubhttps://github.com/saga-project/troy/issues/50#issuecomment-35233656 .
Dr Matteo Turilli Department of Electrical and Computer Engineering Rutgers University
Just a note that I see mixed results: sometimes pillots are canceled, sometimes they aren't Still not sure if the fault is at troy or SP level...
On both FutureGrid and XSEDE overlay_mgr.cancel_overlay(overlay.id) does not kill the pilotjobs of the overlay.