Open pdimens opened 2 weeks ago
I am unable to reproduce this error. Can you check the node status in case of failing jobs, please? sacct -o nodelist,state
will indicate one or more nodes of failed jobs. scontrol show node <some node of a failed job> | grep State
gives the state.
Sure, thanks for the helpful code to get some diagnostic information:
> scontrol show node cbsubscb02 | grep State
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Here's another such log:
Error in rule qc_fastp:
message: SLURM-job '1618942' failed, SLURM status is: 'FAILED'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 1
input: /fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.F.fq.gz (retrieve from storage), /fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.R.fq.gz (retrieve from storage)
output: QC/sample1.R1.fq.gz (send to storage), QC/sample1.R2.fq.gz (send to storage), QC/reports/data/fastp/sample1.fastp.json (send to storage)
log: QC/reports/sample1.html (send to storage), QC/logs/fastp/sample1.log (send to storage), /fs/cbsunt246/workdir/pdimens/harpy/.snakemake/slurm_logs/rule_qc_fastp/sample1/1618942.log (check log file(s) for error details)
conda-env: /fs/cbsunt246/workdir/pdimens/harpy/.snakemake/conda/3868729ed4cdf57e23c7b2c2632ebf5c_
shell:
fastp --trim_poly_g --cut_right --length_required 30 --max_len1 150 --detect_adapter_for_pe --thread 4 -i /home2/pd348/fs/fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.F.fq.gz -I /home2/pd348/fs/fs/cbsunt246/workdir/pdimens/harpy/test/fastq/sample1.R.fq.gz -o /home2/pd348/fs/QC/sample1.R1.fq.gz -O /home2/pd348/fs/QC/sample1.R2.fq.gz -h /home2/pd348/fs/QC/reports/sample1.html -j /home2/pd348/fs/QC/reports/data/fastp/sample1.fastp.json -R "sample1 QC Report" 2> /home2/pd348/fs/QC/logs/fastp/sample1.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: 1618942
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-06-13T152236.344429.snakemake.log
WorkflowError:
At least one job did not complete successfully.
> tree .snakemake/slurm_logs/
.snakemake/slurm_logs/
└── rule_qc_fastp
├── sample1
└── sample2
Meanwhile, the configured NAS directory from the SLURM profile has this structure and the files never copied over:
> tree /home2/pd348 24-06-13 @ 15:23
/home2/pd348
└── fs
My idea was that failing nodes will not be able to feed back log information. Now, by NAS you mean NFS? Regardless, it should be written - eventually. What puzzles me is the distinction between failed and completed jobs: If such an error occurs, it is most likely a cluster error, which occurs always.
Can you confirm (with a minimal example, which always fails), that without the fs-plugin (you can just out comment the respective lines in your profile), the joblog is written?
Sure, I can try that. Just to be clear, the config will look like this?
executor: slurm
default-resources:
slurm_account: nt246_0001
slurm_partition: short
mem_mb: attempt * 2000
runtime: 10
jobs: 4
latency-wait: 60
retries: 0
#default-storage-provider: fs
#local-storage-prefix: /home2/pd348
#shared-fs-usage:
# - persistence
# - software-deployment
# - sources
# - source-cache
I can confirm no logs were written:
.snakemake/slurm_logs
└── rule_qc_fastp
├── sample1
└── sample2
Ok. Now, in theory - not withstanding any undiscovered side effect - Snakemake has no influence on the writing of SLURM logs. All it does is redirecting into a certain path, like any user can do with #SBATCH -o/--output
. When logfiles are not written out, it is usually indicative of some sort of cluster or file system failure.
When I asked you to check, I wanted to make sure that there is no side effect (although the SLURM logs are not affected by the fs plugin). The SLURM plugin itself does not tinker with SLURM logfiles beyond giving them a path and name.
Now, I am out of ideas. But it's late for me ... Tomorrow is another day.
BTW the Snakemake log is far more informative about Snakemake errors than the SLURM log.
Perhaps this question will be helpful: The slurm executor plugin, does it perform the copying and pasting from the node the scheduler designates for a job or on the login node before the job begins on the assigned node? If it's the latter, then it seems like everything should work, where the process would be (parenthesis indicate which node is performing the task):
snakemake(login) => copy to local-prefix (login) => run job (assigned node) => copy to original directory (login)
However, if (like me) your data is on a local storage disk that is not visible to all compute nodes, this process would likely fail:
snakemake(login) => copy to local-prefix (assigned node) => run job (assigned node) => copy to original dir (assigned node)
Given that's my setup, I want to make sure if that's at all a consideration for the behavior I'm observing?
The SLURM output never gets written locally. It is buffered for stdout
and stderr
and written out. There is no relation to the actual job in- or output, which you can stage in or out. Just run tail -f <logfile>
to observe a job producing constant output to check this statement.
Snakemake essentially submits itself. Hence, the SLURM batch part of your job is the output you will see for exactly one Snakemake job (except in the group job case, where it is more). This is what you should observe for completed jobs, too. Now, in your case, SLURM stops writing upon failure - or it does not collect the streams (by just using -o
we let SLURM combine stdout
and stderr
and you see the output along with all warning and error messages) in the first place.
What happens if you run sbatch error.sh
with error.sh
being
#!/bin/bash
#SBATCH -A <your account>
#SBATCH -p regular
#SBATCH -o .snakemake/slurm_logs/test/fail1/my_error.log
mkdir -p .snakemake/slurm_logs/test/fail1/my_error.log
srun cho "cho is no command, it should be 'echo'"
?
Is it possible that the directory of slurm stdout and stderr file does not exist? On the HPC I'm using, if a slurm job trying to create a log file to somewhere before the directory is created, it will fail with nothing output.
The plugin of executor slurm
should consider this.
(this reply may be too late, do you find the real reason?)
Thanks for continuing this thread. I've had to shift to other tasks for the time being but hope to return to this later this week or next week to continue figuring out what's going on.
The slurm log directory should be in .snakemake/slurm, as I understand it, which does exist
you can use scontrol job show 1607095
to veiw the state of job 1607095
, if the path to stdout and stderr exists, then it may be caused by inconsistance of SLURM params.
On a sidenote, just that it is not forgotten: The plugin code will let you know in line 88 (as of version 0.6.1 - a later version might show this in different lines) that the logdir is always created.
I still would like, that you perform this testjob. Note, that a very slow filesystem might lead to inconsistencies: If the login node creates that directory, but the information is unknown on a compute node, the job might fail silently. An alternative explanation might be a missing mount point.
Snakemake version 8.13
Describe the bug I'm using the
slurm
executor plugin to have Snakemake submit jobs to cluster on my behalf. In addition to this, I'm also using thefs
plugin to handle copying and retrieving files from the high-speed scratch directory we are guided to use on our HPC. The profile looks like this:And submitting to the cluster on the login node:
The workflow will create the DAG and copy files as expected, however the jobs tend to fail with STDERR logs like (account name anonymized):
However, the SLURM log that would reveal troubleshooting information,
.snakemake/slurm_logs/rule_genome_faidx/1607095.log
is not present and only gets written after successful completion of a job. This has been consistent for several workflows and different rules.