This pull-request includes all the changes I'm planning for a v0.4 release. It is mostly about:
fixing errors faced in production
handling large genomes
tidying up the generation of the output directories
Summary of the changes:
Upgraded all nf-core modules
Use the newer genomehubs/blobtoolkit Docker image
To address #90, I have updated the modules that contribute to the blobdir to not update their input blobdir in place. However, complete cache reusability is not achievable because the blobdir needs the list of software in a Yaml file, which comes from a Nextflow collectFile call which is not reusable (each call creates a new temporary file)
Files in the blobdir are now compressed (181 MB → 225 KB on the cricket blobdir)
Only the relevant output Busco files are published, and the sequences are tar.gz-ed
The busco_diamond_blastp.nf subworkflow is completely restructured to allow the above
To handle large genomes, I brought some configuration bits from the read-mapping and genome-note pipeline. To make things unambiguous, I have removed the process_* label from all modules that have their own withName entry. That's what most of the .diff files are for. After complete optimisation (TOLIT-1931), since there won't be any withLabel process_* in conf/base.config, we'll be able to undo most of those .diff
This required bringing a few steps to get the genome size and the read counts at the start of the pipeline
The Busco settings are slightly different from the genome-note pipeline as we're finding that interrupted runs (MEMLIMIT/RUNLIMIT) may leave a lot of temporary files, which on the Ubuntu 18.04 farms, take disk space + RAM and prevents other jobs from running on the machines. Once confirmed they work well, they should be backported to the genome-note pipeline.
Quick optimisation: I've patched the seqtk/subseq module to not compress the Fasta file since we had GUNZIP to uncompress it right after
More trace fields by default (the same as in the other pipelines)
More complete and accurate list of recognised file extensions for the reads
TOLIT-2021
This pull-request includes all the changes I'm planning for a v0.4 release. It is mostly about:
Summary of the changes:
genomehubs/blobtoolkit
Docker imagecollectFile
call which is not reusable (each call creates a new temporary file)busco_diamond_blastp.nf
subworkflow is completely restructured to allow the aboveprocess_*
label from all modules that have their ownwithName
entry. That's what most of the.diff
files are for. After complete optimisation (TOLIT-1931), since there won't be anywithLabel process_*
inconf/base.config
, we'll be able to undo most of those.diff
seqtk/subseq
module to not compress the Fasta file since we hadGUNZIP
to uncompress it right afterTest runs
Initially failures:
And my runs on the same input data:
I've also tried three assemblies that had failed previously and left some files in
/tmp
.and the full test
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).