radical-cybertools / radical.owms

Tiered Resource OverlaY
Other
0 stars 1 forks source link

overlay_mgr.cancel_overlay(overlay.id) does not work #50

Open mturilli opened 10 years ago

mturilli commented 10 years ago

On both FutureGrid and XSEDE overlay_mgr.cancel_overlay(overlay.id) does not kill the pilotjobs of the overlay.

PBS Job Id: 1926861.trestles-fe1.local
Job Name:   SAGA-Python-PBSJobScript.lP2b85
Exec host:  trestles-9-20/0+trestles-9-20/1+trestles-9-20/2+trestles-9-20/3+trestles-9-20/4+trestles-9-20/5+trestles-9-20/6+trestles-9-20/7+trestles-9-20/8+trestles-9-20/9+trestles-9-20/10+trestles-9-20/11+trestles-9-20/12+trestles-9-20/13+trestles-9-20/14+trestles-9-20/15+trestles-9-20/16+trestles-9-20/17+trestles-9-20/18+trestles-9-20/19+trestles-9-20/20+trestles-9-20/21+trestles-9-20/22+trestles-9-20/23+trestles-9-20/24+trestles-9-20/25+trestles-9-20/26+trestles-9-20/27+trestles-9-20/28+trestles-9-20/29+trestles-9-20/30+trestles-9-20/31
Aborted by PBS Server
Job exceeded its walltime limit. Job was aborted
See Administrator for help
Exit_status=-11
resources_used.cput=00:00:24
resources_used.mem=26036kb
resources_used.vmem=884476kb
resources_used.walltime=05:00:13
andre-merzky commented 10 years ago

You will see this also in the log when the pilots time out before the cancel is called -- which happens sometimes. I'll check if this is the case.

I assume the ticket is about sagapilot, right?

mturilli commented 10 years ago

Unfortunately, I don't know where the error is but I see it from TROY so I opened a ticket here. These are the last lines of the logs:


2014:02:17 00:01:31 MainThread   troy.logger           : [INFO    ]
workload_1 done

2014:02:17 00:01:31 MainThread   troy.logger           : [INFO    ]
workload_2 done

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
start    : cancel [] : 2014-02-17 05:01:31.371043 (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
stop     : cancel [] : 2014-02-17 05:01:31.371375 (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
duration : cancel [] : 0.000332  sec

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
start    : cancel [] : 2014-02-17 05:01:31.371849 (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
stop     : cancel [] : 2014-02-17 05:01:31.372102 (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
duration : cancel [] : 0.000253  sec

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
start    : cancel [] : 2014-02-17 05:01:31.372594 (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
event [p.0001]   : start : ['sinon'] (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
event [p.0001]   : state_detail : ['sinon', u"Created agent directory
'sftp://
india.futuregrid.org//N/u/mturilli/troy_agents/pilot-5301975e3cf749400de16537/'."]
(UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
event [p.0001]   : state_detail : ['sinon', u"Copied
'file://localhost//Users/mturilli/Virtualenvs/TROY_master/bin/bootstrap-and-run-agent'
script to agent directory."] (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
event [p.0001]   : state_detail : ['sinon', u"Copied
'file://localhost//Users/mturilli/Virtualenvs/TROY_master/lib/python2.7/site-packages/sagapilot/agent/sagapilot-agent.py'
script to agent directory."] (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
event [p.0001]   : state_detail : ['sinon', u"ComputePilot agent
successfully submitted with JobID
'[pbs+ssh://india.futuregrid.org]-[1470595]'"]
(UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [INFO    ] cancel
pilot    p.0001

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
stop     : cancel [] : 2014-02-17 05:01:31.378082 (UTC)

2014:02:17 00:01:31 MainThread   troy.logger           : [DEBUG   ] timed
duration : cancel [] : 0.005488  sec

This line:

2014:02:17 00:01:31 MainThread   troy.logger           : [INFO    ] cancel
pilot    p.0001

Seems to indicate that TROY think that the pilot has been cancel but there are no complaints/logs from sagapilot to show that such a request has been honored.

On Mon, Feb 17, 2014 at 2:23 AM, Andre Merzky notifications@github.comwrote:

You will see this also in the log when the pilots time out before the cancel is called -- which happens sometimes. I'll check if this is the case.

I assume the ticket is about sagapilot, right?

Reply to this email directly or view it on GitHubhttps://github.com/saga-project/troy/issues/50#issuecomment-35233656 .

Dr Matteo Turilli Department of Electrical and Computer Engineering Rutgers University

andre-merzky commented 10 years ago

Just a note that I see mixed results: sometimes pillots are canceled, sometimes they aren't Still not sure if the fault is at troy or SP level...