snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
9 stars 13 forks source link

Memorize job nodes and exclude broken nodes #15

Open cmeesters opened 6 months ago

cmeesters commented 6 months ago

Motivation:

  1. when jobs are submitted on a cluster with persistent local node space, it might be advantageous to re-submit jobs to the very same node(s) to avoid overhead in stage-ins or downloads. Note that this might have limited use, as persistent local node disk space is rare and only useful on clusters without much competition (else other users will likely use the node in between and all scratch will be deleted anyway.)
  2. when jobs fail due to broken cluster nodes (which might be detected automatically and the nodes in questions will be closed for submission on most clusters) re-scheduled jobs are likely to end up on those nodes because they will always be empty and able to accept new jobs, thereby creating a "black hole". The new feature can memorize nodes of failed jobs and attempt to exclude those jobs.

Implementation:

Keep a persistent list of preferred nodes. Notice possibly broken nodes, erase those from preferred nodes.

Submit to preferred nodes (optionally, as this might lead to longer wait times!). Exclude possibly broken nodes from submission. Report possibly broken nodes.

Will only work in the context of ONE workflow.

Idea be @johanneskoester and @cmeesters