snakemake / snakemake

This is the development home of the workflow management system Snakemake. For general information, see
https://snakemake.github.io
MIT License
2.17k stars 521 forks source link

slurm executor not printing logs on fail #2911

Open pdimens opened 2 weeks ago

pdimens commented 2 weeks ago

Snakemake version 8.13

Describe the bug I'm using the slurm executor plugin to have Snakemake submit jobs to cluster on my behalf. In addition to this, I'm also using the fs plugin to handle copying and retrieving files from the high-speed scratch directory we are guided to use on our HPC. The profile looks like this:

__use_yte__: true
executor: slurm
default-resources:
  slurm_account: xxxxxx
  slurm_partition: regular
  mem_mb: attempt * 2000
  runtime: 10
jobs: 4
latency-wait: 60
retries: 1
default-storage-provider: fs
local-storage-prefix: /home2/xxxxxx
shared-fs-usage:
  - persistence
  - software-deployment
  - sources
  - source-cache

And submitting to the cluster on the login node:

snakemake --snakefile workflow.smk --cores 10 --workflow-profile profile/

The workflow will create the DAG and copy files as expected, however the jobs tend to fail with STDERR logs like (account name anonymized):

Error in rule genome_faidx:
    message: SLURM-job '1607095' failed, SLURM status is: 'FAILED'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 10
    input: Genome/genome.fasta.gz (retrieve from storage)
    output: Genome/genome.fasta.gz.fai (send to storage), Genome/genome.fasta.gz.gzi (send to storage)
    log: Genome/genome.fasta.gz.faidx.log (send to storage), /fs/XXXXXX/workdir/XXXXX/harpy/.snakemake/slurm_logs/rule_genome_faidx/1607095.log (check log file(s) for error details)
    shell:

        if [ "True" = "True" ]; then
            samtools faidx --gzi-idx /home2/XXXXX/fs/Genome/genome.fasta.gz.gzi --fai-idx /home2/XXXXXX/fs/Genome/genome.fasta.gz.fai /home2/XXXXX/fs/Genome/genome.fasta.gz 2> /home2/XXXXX/fs/Genome/genome.fasta.gz.faidx.log
        else
            samtools faidx --fai-idx /home2/XXXXX/fs/Genome/genome.fasta.gz.fai /home2/XXXXX/fs/Genome/genome.fasta.gz 2> /home2/XXXXX/fs/Genome/genome.fasta.gz.faidx.log
        fi

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 1607095

However, the SLURM log that would reveal troubleshooting information, .snakemake/slurm_logs/rule_genome_faidx/1607095.log is not present and only gets written after successful completion of a job. This has been consistent for several workflows and different rules.

cmeesters commented 1 week ago

I am unable to reproduce this error. Can you check the node status in case of failing jobs, please? sacct -o nodelist,state will indicate one or more nodes of failed jobs. scontrol show node <some node of a failed job> | grep State gives the state.

pdimens commented 1 week ago

Sure, thanks for the helpful code to get some diagnostic information:

> scontrol show node cbsubscb02 | grep State
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
pdimens commented 1 week ago

Here's another such log:

Error in rule qc_fastp:
    message: SLURM-job '1618942' failed, SLURM status is: 'FAILED'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 1
    input: /fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.F.fq.gz (retrieve from storage), /fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.R.fq.gz (retrieve from storage)
    output: QC/sample1.R1.fq.gz (send to storage), QC/sample1.R2.fq.gz (send to storage), QC/reports/data/fastp/sample1.fastp.json (send to storage)
    log: QC/reports/sample1.html (send to storage), QC/logs/fastp/sample1.log (send to storage), /fs/cbsunt246/workdir/pdimens/harpy/.snakemake/slurm_logs/rule_qc_fastp/sample1/1618942.log (check log file(s) for error details)
    conda-env: /fs/cbsunt246/workdir/pdimens/harpy/.snakemake/conda/3868729ed4cdf57e23c7b2c2632ebf5c_
    shell:

        fastp --trim_poly_g --cut_right --length_required 30 --max_len1 150 --detect_adapter_for_pe  --thread 4 -i /home2/pd348/fs/fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.F.fq.gz -I /home2/pd348/fs/fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.R.fq.gz -o /home2/pd348/fs/QC/sample1.R1.fq.gz -O /home2/pd348/fs/QC/sample1.R2.fq.gz -h /home2/pd348/fs/QC/reports/sample1.html -j /home2/pd348/fs/QC/reports/data/fastp/sample1.fastp.json -R "sample1 QC Report" 2> /home2/pd348/fs/QC/logs/fastp/sample1.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 1618942

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-06-13T152236.344429.snakemake.log
WorkflowError:
At least one job did not complete successfully.

> tree .snakemake/slurm_logs/
.snakemake/slurm_logs/
└── rule_qc_fastp
    ├── sample1
    └── sample2

Meanwhile, the configured NAS directory from the SLURM profile has this structure and the files never copied over:

> tree /home2/pd348                                                                                                                                                          24-06-13 @ 15:23
/home2/pd348
└── fs
cmeesters commented 1 week ago

My idea was that failing nodes will not be able to feed back log information. Now, by NAS you mean NFS? Regardless, it should be written - eventually. What puzzles me is the distinction between failed and completed jobs: If such an error occurs, it is most likely a cluster error, which occurs always.

Can you confirm (with a minimal example, which always fails), that without the fs-plugin (you can just out comment the respective lines in your profile), the joblog is written?

pdimens commented 1 week ago

Sure, I can try that. Just to be clear, the config will look like this?

executor: slurm
default-resources:
  slurm_account: nt246_0001
  slurm_partition: short
  mem_mb: attempt * 2000
  runtime: 10
jobs: 4
latency-wait: 60
retries: 0
#default-storage-provider: fs
#local-storage-prefix: /home2/pd348
#shared-fs-usage:
#  - persistence
#  - software-deployment
#  - sources
#  - source-cache
pdimens commented 1 week ago

I can confirm no logs were written:

.snakemake/slurm_logs
└── rule_qc_fastp
    ├── sample1
    └── sample2
cmeesters commented 1 week ago

Ok. Now, in theory - not withstanding any undiscovered side effect - Snakemake has no influence on the writing of SLURM logs. All it does is redirecting into a certain path, like any user can do with #SBATCH -o/--output. When logfiles are not written out, it is usually indicative of some sort of cluster or file system failure.

When I asked you to check, I wanted to make sure that there is no side effect (although the SLURM logs are not affected by the fs plugin). The SLURM plugin itself does not tinker with SLURM logfiles beyond giving them a path and name.

Now, I am out of ideas. But it's late for me ... Tomorrow is another day.

BTW the Snakemake log is far more informative about Snakemake errors than the SLURM log.

pdimens commented 1 week ago

Perhaps this question will be helpful: The slurm executor plugin, does it perform the copying and pasting from the node the scheduler designates for a job or on the login node before the job begins on the assigned node? If it's the latter, then it seems like everything should work, where the process would be (parenthesis indicate which node is performing the task):

snakemake(login) => copy to local-prefix (login) => run job (assigned node) => copy to original directory (login)

However, if (like me) your data is on a local storage disk that is not visible to all compute nodes, this process would likely fail:

snakemake(login) => copy to local-prefix (assigned node) => run job (assigned node) => copy to original dir (assigned node)

Given that's my setup, I want to make sure if that's at all a consideration for the behavior I'm observing?

cmeesters commented 1 week ago

The SLURM output never gets written locally. It is buffered for stdout and stderr and written out. There is no relation to the actual job in- or output, which you can stage in or out. Just run tail -f <logfile> to observe a job producing constant output to check this statement.

Snakemake essentially submits itself. Hence, the SLURM batch part of your job is the output you will see for exactly one Snakemake job (except in the group job case, where it is more). This is what you should observe for completed jobs, too. Now, in your case, SLURM stops writing upon failure - or it does not collect the streams (by just using -o we let SLURM combine stdout and stderr and you see the output along with all warning and error messages) in the first place.

What happens if you run sbatch error.sh with error.sh being

#!/bin/bash
#SBATCH -A <your account>
#SBATCH -p regular
#SBATCH -o .snakemake/slurm_logs/test/fail1/my_error.log

mkdir -p .snakemake/slurm_logs/test/fail1/my_error.log
srun cho "cho is no command, it should be 'echo'"

?

Hocnonsense commented 1 day ago

Is it possible that the directory of slurm stdout and stderr file does not exist? On the HPC I'm using, if a slurm job trying to create a log file to somewhere before the directory is created, it will fail with nothing output.

The plugin of executor slurm should consider this.

(this reply may be too late, do you find the real reason?)

pdimens commented 1 day ago

Thanks for continuing this thread. I've had to shift to other tasks for the time being but hope to return to this later this week or next week to continue figuring out what's going on.

The slurm log directory should be in .snakemake/slurm, as I understand it, which does exist

Hocnonsense commented 1 day ago

you can use scontrol job show 1607095 to veiw the state of job 1607095, if the path to stdout and stderr exists, then it may be caused by inconsistance of SLURM params.

cmeesters commented 23 hours ago

On a sidenote, just that it is not forgotten: The plugin code will let you know in line 88 (as of version 0.6.1 - a later version might show this in different lines) that the logdir is always created.

I still would like, that you perform this testjob. Note, that a very slow filesystem might lead to inconsistencies: If the login node creates that directory, but the information is unknown on a compute node, the job might fail silently. An alternative explanation might be a missing mount point.