Sometimes, when e.g. a job fails due to some local disk running full on one cluster node, I wish I was able to detect this condition in the manager and reschedule the job on a different node w/o manual intervention.
It would be nice if sisyphus allowed hooking job failure/success in some way externally through gs, in a similar fashion to how worker_wrapper works. I could imagine something like:
import os
from typing import Any, List, Optional
from sisyphus import tk
def task_exit(
*,
job: tk.Job,
task_name: str,
success: bool,
engine_info: Any, # last engine_info from submit_log
usage: Any, # usage from usage log
call: List[str], # actual call after worker_wrapper
# log: str, # could also feed in the log output here? depending on the job very memory intensive...
) -> Optional[bool]:
if success:
return
with open(os.path.join(job.work_path(), f"log.{task_name}.1")) as log_file:
for line in log_file:
if "We cannot free enough space on /ssd" in line:
job.update_rqmt(
{"sbatch_args": [*job.rqmt.sbatch_args, "-x", usage["host"]]}
)
return True # reschedule
Sometimes, when e.g. a job fails due to some local disk running full on one cluster node, I wish I was able to detect this condition in the manager and reschedule the job on a different node w/o manual intervention.
It would be nice if sisyphus allowed hooking job failure/success in some way externally through
gs
, in a similar fashion to howworker_wrapper
works. I could imagine something like: