snakemake / snakemake

This is the development home of the workflow management system Snakemake. For general information, see
https://snakemake.github.io
MIT License
2.25k stars 547 forks source link

Waiting for files with slurm, But file exists. #775

Open eHirchaud opened 3 years ago

eHirchaud commented 3 years ago

Snakemake version

Describe the bug Job SUCCESS with slurm but Snakemake waiting for file very long time and fail. Latency-wait option up to 420. We reproduce this bug on 2 differents clusters. We don't understand why it is so long.

Logs Error (but files exists):

MissingOutputException in line 37 of ***/Snakefile_test:
Job Missing files after 420 seconds:
fastq_clean/20P007577/20P007577_087_SE.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 1 completed successfully, but some output files are missing. 1
  File "/nfs/conda/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 814, in handle_job_success
  File "/nfs/conda/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 247, in handle_job_success

Without --cluster-status.py Snakemake finish well but elapsed time is about 1h30 ! We try in local and the workflows make 1m30...

   "files": {
        "../../fastq_clean/20P007577/20P007577_087_SE.fastq.gz": {
            "start-time": "Fri Nov 20 14:46:06 2020",
            "stop-time": "Fri Nov 20 14:46:44 2020",
            "duration": 38.27743935585022,
            "priority": 0,
            "resources": {
                "_cores": 2,
                "_nodes": 1
            }
        },
        "fastq_clean/20P007577/20P007577_087_SE.fastq.gz": {
            "start-time": "Mon Nov 23 16:55:57 2020",
            "stop-time": "Mon Nov 23 17:25:58 2020",
            "duration": 1801.778554201126,
            "priority": 0,
            "resources": {
                "_cores": 20,
                "_nodes": 1
            }
        }

Minimal example

snakemake -s Snakefile_test --use-conda --cluster "sbatch -A plucas -p bioinfo --cpus-per-task=1 --parsable" --stats stats_saturn.json --latency-wait 420 --cluster-status ./status.py -p -j 2

Snakefile contains one rule to use trimmomatic on fastq.gz file.

Additional context

kevmu commented 3 years ago

Have exact same issue. NFS and SLURM. Snakemake failing with latency error. Is there a fix for this issue?

rargelaguet commented 3 years ago

I have the same exact issue...

snajder-r commented 2 years ago

Same problem here. Also NFS and SLURM. For me it's not even the output of a rule, but the very first input files in the pipeline.

snajder-r commented 2 years ago

Ok, it looks like my issue was actually #1527 which was fixed a week ago.

epruesse commented 10 months ago

@eHirchaud You are setting actimeo to 1800, so 30 minutes, and you did not change lookupcache from it's default all, so no nfs lookups for the missing files will happen until the attributes expire.

You should set latency-seconds to the same value as actimeo, or a bit higher to be safe. Set actimeo to as long as you are willing to wait for a failed job. I'd suggest no more than 30 seconds.

epruesse commented 10 months ago

@johanneskoester You can close this. The issue of OP was NFS settings. Only option is to make the exception more verbose. Short of trying to query kernel for timeout settings to set latency-wait defaults accordingly.