nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.72k stars 621 forks source link

Resubmitted job with '-resume' and process.cache 'true' not always cache all the completed jobs, restart from fully or partially! #1629

Closed justinjj24 closed 3 years ago

justinjj24 commented 4 years ago

Bug report!

It seems the nextflow job with '-resume' not always cache all the completed jobs! sometimes it restarts all over again or caches only a certain number of tasks with the default setting process.cache = 'true'. This behavior noticed frequently with or without updating the memory or wall time for the failed task and no other changes were made in the params/config and versions of nextflow/java.

Environment

Thanks in advance for your help and input. Justin
* Nextflow version: version 20.01.0 build 5264
created 12-02-2020 10:14 UTC (18:14 SGST)
cite doi:10.1038/nbt.3820
http://nextflow.io

* Java version: openjdk version "1.8.0_92"
OpenJDK Runtime Environment (Zulu 8.15.0.1-linux64) (build 1.8.0_92-b15)
OpenJDK 64-Bit Server VM (Zulu 8.15.0.1-linux64) (build 25.92-b15, mixed mode)

* Operating system: PBS scheduler

Thanks in advance for your help and input. Justin

huguesfontenelle commented 4 years ago

I'm having a similar problem right now, which is why I came here... But then I remembered a couple of blog posts that I read a while ago:

jvivian-atreca commented 3 years ago

I am running into this issue now on a local run. I have cache set to lenient and if i ctrl+c the run and rerun it with -resume with NO changes to input files or code, the pipeline will cache anywhere between 4 and several thousand jobs. I have to do this several times until it catches up to the correct place. This is on an EC2 machine with a mounted EBS volume so I'm not sure what would cause this.

First run shows 2 cached jobs for mmseqs2_filtering:search:

image

ctrl+c and run again and it caches several thousand jobs. Again, no changes were made between execution of the same command.

image

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

chartl-rancho commented 3 years ago

Dropping a line to re-open. Just experienced this running on a server with a fixed weekly restart that kills the NF pipeline.

The first run appears to do fine with a --restart; but then fails to cache any future work as the pipeline progresses.

The 2nd, 3rd (etc) restarts all roll back to the 1-week progress (or thereabouts).

ggavelis commented 2 years ago

Has anyone found a workaround?

Nextflow caches only about 7% of the outputs from my third process (and caches even fewer from processes downstream). This isn't fixed by using the cache 'lenient' or 'deep' directives. (nextflow version 21.04.3.5560).

I also tried the -dump-hashes option suggested by the resume_troubleshooting_tips but the output was cryptic (to me at least)

philippbayer commented 2 years ago

I'm also encountering this with Nextflow v21.04.3 and the nf-core/mag pipeline revision e065754c46.

I think this is due to Lustre file-system, setting cache = 'lenient' in my nextflow.config helped with this issue somewhat, but when one of the last nextflow jobs crashes running -resume still reruns from 1000/1200 jobs, not from 1199/1200. I have a feeling that giving the exact job name to resume helps a bit (-resume whatever_the_last_job_was_named), but the rerun-job-numbers are so random that it's hard to tell.

I have also set scratch = '/tmp' for the BUSCO job as it was generating too many files for the per-user-1-million-files-limit, that could cause some kind of issue but the resulting files are definitely generated and are in the results/ folder.

It's a tricky problem that is probably caused by many different issues.

chartl-rancho commented 2 years ago

My specific issue was with the order of Channels not being preserved. For instance x = Channel.fromPath('foo.csv').splitCsv(header:true) keeps the line ordering of foo.csv; but y = some_process(x.map{it[2]}) does not necessarily maintain that order.

I found that keeping a joinable key in all outputs and using .join to maintain order between runs ensures that --resume works as designed.