saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

Rogue Agents / pilot_job.cancel() v.s. pilot_job_service.cancel() #131

Closed oleweidner closed 11 years ago

oleweidner commented 11 years ago

This script https://github.com/saga-project/BigJob/blob/develop-prod/tests/test_connection_pooling.py leaves a lot of zombies behind on repex2 where it is integrated with Jenkins:

  |-sh /var/lib/jenkins/.saga/adaptors/shell_job//monitor.sh 29897 /var/lib/jenkins/.saga/adaptors/shell_job//29897
  |   `-sh /var/lib/jenkins/.saga/adaptors/shell_job//29897/cmd
  |       `-python -c...
  |           `-6*[{python}]
  |-sh /var/lib/jenkins/.saga/adaptors/shell_job//monitor.sh 29994 /var/lib/jenkins/.saga/adaptors/shell_job//29994
  |   `-sh /var/lib/jenkins/.saga/adaptors/shell_job//29994/cmd
  |       `-python -c...
  |           `-6*[{python}]
  |-sh /var/lib/jenkins/.saga/adaptors/shell_job//monitor.sh 30085 /var/lib/jenkins/.saga/adaptors/shell_job//30085
  |   `-sh /var/lib/jenkins/.saga/adaptors/shell_job//30085/cmd
  |       `-python -c...
  |           `-6*[{python}]
  |-sh /var/lib/jenkins/.saga/adaptors/shell_job//monitor.sh 30177 /var/lib/jenkins/.saga/adaptors/shell_job//30177
  |   `-sh /var/lib/jenkins/.saga/adaptors/shell_job//30177/cmd
  |       `-python -c...
  |           `-6*[{python}]
  |-sh /var/lib/jenkins/.saga/adaptors/shell_job//monitor.sh 30283 /var/lib/jenkins/.saga/adaptors/shell_job//30283
  |   `-sh /var/lib/jenkins/.saga/adaptors/shell_job//30283/cmd
  |       `-python -c...
  |           `-6*[{python}]
  |-sh /var/lib/jenkins/.saga/adaptors/shell_job//monitor.sh 30383 /var/lib/jenkins/.saga/adaptors/shell_job//30383
  |   `-sh /var/lib/jenkins/.saga/adaptors/shell_job//30383/cmd
  |       `-python -c...
  |           `-6*[{python}]

As per Melissa's suggestion, I use pilot_service.cancel() at the end of the script. But this doesn't seem to cancel the individual pilots / agents. Do I need to cancel them individually, e.g.,

for i, pj in enumerate(pjs):
    print "cancel %3d" % i
    pj.cancel()

I could obviously put this back into the code, however, I think that PJs should get canceled implicitly if you cancel their 'parent' service?

oleweidner commented 11 years ago

This is related I think: https://github.com/saga-project/BigJob/issues/121

melrom commented 11 years ago

Hi Ole-

I might need help with this ticket. I looked in the code and the comments say - which makes me think the PJs should be canceled too:

def cancel(self):
    """ Cancel the PilotComputeService.

        This also cancels all the PilotJobs that were under control of this PJS.

        Keyword arguments:
        None

        Return value:
        Result of operation
    """
    for i in self.pilot_computes:
        i.cancel()

The i.cancel() is then calling this on each PilotJob:

def cancel(self):
    """ Terminates the pilot """
    self.__bigjob.cancel()

And __bigjob.cancel() looks like it should do the trick?:

def cancel(self):
    """ duck typing for cancel of saga.cpr.job and saga.job.job  """
    logger.debug("Cancel Pilot Job")
    try:
        if self.url.scheme.startswith("condor")==False:
            self.job.cancel()
        else:
            pass
            #logger.debug("Output files are being transfered to file: outpt.tar.gz. Please wait until transfer is complete.")
    except:
        pass
        #traceback.print_stack()

    logger.debug("Cancel Job Service")
    try:
        if  not self._pool.del_value (self.js) :
            del (self.js)
        self.js = None
    except:
        pass
        #traceback.print_stack()

    try:
        self._stop_pilot_job()
        logger.debug("delete pilot job: " + str(self.pilot_url))
        if _CLEANUP:
            self.coordination.delete_pilot(self.pilot_url)
        #os.remove(os.path.join("/tmp", "bootstrap-"+str(self.uuid)))
    except:
        pass
        #traceback.print_stack()
    logger.debug("Cancel Pilot Job finished")
andre-merzky commented 11 years ago

Is this ticket still valid? I thought we fixed this in BigJob?

oleweidner commented 11 years ago

I think this has been resolved. Did it pop up again?

andre-merzky commented 11 years ago

No - Melissa just stumbled over the ticket when checking something for Matteo, and we wondered why it was still open...

oleweidner commented 11 years ago

Then close it… ;-)

On Aug 27, 2013, at 22:29 , Andre Merzky notifications@github.com wrote:

No - Melissa just stumbled over the ticket when checking something for Matteo, and we wondered why it was still open...

— Reply to this email directly or view it on GitHub.

andre-merzky commented 11 years ago

Bang bang!