snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

Why do all snakemake-jobs sent to the same group fail if one of them fail? #110

Closed bdelepine closed 4 months ago

bdelepine commented 4 months ago

Hi,

I encountered a behavior that was non-intuitive for me while using group directive and I would like to know the reason why. Long story short, I observed that ALL snakemake-jobs sent to the same group fail if one of them fail, which has for consequence that all their output is removed by Snakemake. I would have expected that only the output of failing snakemake-jobs would be removed, just like if I was not using the group directive. I am not sure if it is plugin-related... but I guess it might be related to how snakemake-job success/failure status is defined in regard of the SLURM job success/failure status.

Basically, I have something like that (not tested) :

rule quick:
    input:
        "result_long.txt"
    output:
        "result_quick.txt"
    group: "grouped"
    shell:
        """
        wait 10  # something quick
        touch result_quick.txt
        false      # something unexpected failed! 
        """

rule long:
    output:
        "result_long.txt"
    group: "grouped"
    shell:
        """
        wait 360   # something long
        touch result_long.txt
        """

What happens:

  1. long is executed with success
  2. quick fails
  3. result_quick.txt and result_long.txt are removed

I would expect result_long.txt not to be removed. Is this an expected behavior? Why?

Thank you

cmeesters commented 4 months ago

Basically, because a group job is one Snakemake/SLURM job. If a job is aborted, there is no way to determine whether the outputs are corrupted or not. Usually, there is some kind of dependence between the individual rules. So, there is no way to decide which outputs are to be saved. Also, a (successful) retry might see an execution of a feeding rule, which gives an altered input for a consuming rule. We then should not save the outputs, either.

I hope these scenarios are understandable. Besides, changing that behaviour would be pretty challenging.

bdelepine commented 4 months ago

Hi, thank you for your answer. Arg... I thought there could be some kind of access to individual snakemake-jobs/rules status and that we could rescue their output if they completed before their SLURM-job failed (due to another snakemake-job/rule). I understand that it is not the case 😞 Thanks again!