Hooking job exit/failure

Sometimes, when e.g. a job fails due to some local disk running full on one cluster node, I wish I was able to detect this condition in the manager and reschedule the job on a different node w/o manual intervention.

It would be nice if sisyphus allowed hooking job failure/success in some way externally through gs, in a similar fashion to how worker_wrapper works. I could imagine something like:

import os
from typing import Any, List, Optional
from sisyphus import tk

def task_exit(
    *,
    job: tk.Job,
    task_name: str,
    success: bool,
    engine_info: Any,  # last engine_info from submit_log
    usage: Any,  # usage from usage log
    call: List[str],  # actual call after worker_wrapper
    # log: str,  # could also feed in the log output here? depending on the job very memory intensive...
) -> Optional[bool]:
    if success:
        return

    with open(os.path.join(job.work_path(), f"log.{task_name}.1")) as log_file:
        for line in log_file:
            if "We cannot free enough space on /ssd" in line:
                job.update_rqmt(
                    {"sbatch_args": [*job.rqmt.sbatch_args, "-x", usage["host"]]}
                )
                return True  # reschedule

rwth-i6 / sisyphus

Hooking job exit/failure #204