saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

devel_prod Pilots Do Not Seem to Terminate #96

Closed melrom closed 11 years ago

melrom commented 11 years ago

This bug applies to the devel-prod branch. At current, it needs further investigation. It was first noticed by me, and then seconded by Vishal.

The behavior is such that - a pilot starts up and then executes CUs. At the end of the scripts, the following cancel commands are issued:

compute_data_service.cancel()
pilot_compute_service.cancel()

However, the Pilot still appears to be running in the queue until it hits the MAX walltime. This was first noticed using pbs+gsissh to Kraken, and then noticed again using pbs://localhost on India.

We need to make sure the mechanism for shutting down Pilot Jobs after CUs are completed is working properly on this branch.

oleweidner commented 11 years ago

I will have a look at this today on the saga-python end to make sure that cancel() is implemented properly.

oleweidner commented 11 years ago

I have tested job.cancel() with the PBS adaptor (on india) extensively and it seems to work just fine. However, I'm not even sure if jobs get canceled through saga or if a termination signal is sent to the agent via Redis. Andre, can you clarify?

drelu commented 11 years ago

both, cancel is called and a stop signal is sent via Redis.

oleweidner commented 11 years ago

Melissa, now that I have checked the saga-python side, can you please investigate on the BigJob side?

oleweidner commented 11 years ago

@oleweidner: to test with pbs://localhost explicitly

melrom commented 11 years ago

Investigated pbs://localhost with saga-python master and devel-prod master - pilots seem to terminate properly. verified via qstat. Closing this ticket - will reopen if behavior becomes noticeable again.