Resubmitted job with '-resume' and process.cache 'true' not always cache all the completed jobs, restart from fully or partially!

justinjj24 commented 4 years ago

Bug report!

It seems the nextflow job with '-resume' not always cache all the completed jobs! sometimes it restarts all over again or caches only a certain number of tasks with the default setting `process.cache = 'true'`. This behavior noticed frequently with or without updating the memory or wall time for the failed task and no other changes were made in the params/config and versions of nextflow/java.

we seek to perform gVCF joint-calling following GTAK4 best-practices, the central tenet of which leverage on loading all gVCF files into genomicsDB to then be efficiently read back as a whole on genome-regions slices
running GenomicsDBImport with a very large number of intervals region (~10,000)
the workflow gets executed as jobs submitted to a PBS scheduler
nextflow seems to happily proceed and processes are submitted to the scheduler creating 100+ jobs. submitted jobs successfully complete and any failure due to insufficient memory or wall time exit with proper exit code.
When we resbumit the job with '-resume' sometimes it caches all the successfully completed jobs and rerunning only the failed jobs, but this is not always the case! It reruns a certain number of completed tasks or the majority of jobs with caching very less number of completed jobs!
used the default value for the process.cache = 'true'
the whole process takes number of days/week to complete, any failure to restart with '-resume' which failed to cache the completed jobs are very painful in terms of time.
could you please let me know what's going wrong here and how to overcome this issue.
using the cache = 'deep' or 'lenient' will solve this problem, when I '-resume' either of this value?
Occasionally when I remove or add new intervals regions (~10,000) due to large size and resubmit the job with '-resume' the completed tasks not cached and it restarts from the beginning which is understandable since the changes in the cache keys created indexing input files meta-data information changed with default setting process.cache = 'true'

Environment

Thanks in advance for your help and input. Justin
* Nextflow version: version 20.01.0 build 5264
created 12-02-2020 10:14 UTC (18:14 SGST)
cite doi:10.1038/nbt.3820
http://nextflow.io

* Java version: openjdk version "1.8.0_92"
OpenJDK Runtime Environment (Zulu 8.15.0.1-linux64) (build 1.8.0_92-b15)
OpenJDK 64-Bit Server VM (Zulu 8.15.0.1-linux64) (build 25.92-b15, mixed mode)

* Operating system: PBS scheduler

Thanks in advance for your help and input. Justin

huguesfontenelle commented 4 years ago

I'm having a similar problem right now, which is why I came here... But then I remembered a couple of blog posts that I read a while ago:

jvivian-atreca commented 3 years ago

I am running into this issue now on a local run. I have cache set to lenient and if i ctrl+c the run and rerun it with -resume with NO changes to input files or code, the pipeline will cache anywhere between 4 and several thousand jobs. I have to do this several times until it catches up to the correct place. This is on an EC2 machine with a mounted EBS volume so I'm not sure what would cause this.

First run shows 2 cached jobs for mmseqs2_filtering:search:

ctrl+c and run again and it caches several thousand jobs. Again, no changes were made between execution of the same command.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

chartl-rancho commented 3 years ago

Dropping a line to re-open. Just experienced this running on a server with a fixed weekly restart that kills the NF pipeline.

The first run appears to do fine with a --restart; but then fails to cache any future work as the pipeline progresses.

The 2nd, 3rd (etc) restarts all roll back to the 1-week progress (or thereabouts).

ggavelis commented 2 years ago

Has anyone found a workaround?

Nextflow caches only about 7% of the outputs from my third process (and caches even fewer from processes downstream). This isn't fixed by using the cache 'lenient' or 'deep' directives. (nextflow version 21.04.3.5560).

I also tried the -dump-hashes option suggested by the resume_troubleshooting_tips but the output was cryptic (to me at least)

philippbayer commented 2 years ago

I'm also encountering this with Nextflow v21.04.3 and the nf-core/mag pipeline revision e065754c46.

I think this is due to Lustre file-system, setting cache = 'lenient' in my nextflow.config helped with this issue somewhat, but when one of the last nextflow jobs crashes running -resume still reruns from 1000/1200 jobs, not from 1199/1200. I have a feeling that giving the exact job name to resume helps a bit (-resume whatever_the_last_job_was_named), but the rerun-job-numbers are so random that it's hard to tell.

I have also set scratch = '/tmp' for the BUSCO job as it was generating too many files for the per-user-1-million-files-limit, that could cause some kind of issue but the resulting files are definitely generated and are in the results/ folder.

It's a tricky problem that is probably caused by many different issues.

chartl-rancho commented 2 years ago

My specific issue was with the order of Channels not being preserved. For instance x = Channel.fromPath('foo.csv').splitCsv(header:true) keeps the line ordering of foo.csv; but y = some_process(x.map{it[2]}) does not necessarily maintain that order.

I found that keeping a joinable key in all outputs and using .join to maintain order between runs ensures that --resume works as designed.

nextflow-io / nextflow

Resubmitted job with '-resume' and process.cache 'true' not always cache all the completed jobs, restart from fully or partially! #1629

Bug report!

Environment