nf-core / mag

Assembly and binning of metagenomes
https://nf-co.re/mag
MIT License
191 stars 102 forks source link

Develop full suite of tests for manual execution #501

Open jfy133 opened 10 months ago

jfy133 commented 10 months ago

Description of feature

A major problem we currently have during development is our CI tests are nowhere near comprehensive enough due to the pipeline utilising extremely large database files that do not fit in GHA resource allocations.

We should develop and document a suite of manual tests developers should run on their own infrastructure to ensure the pipeline is indeed working as intended.

mag missing configs and tests

For Automated CI

For manual CI

Does not need a database
Datbases on AWS
Databases NOT on AWS
prototaxites commented 4 months ago

Metaeuk

For MetaEuk, specifying params.metaeuk_mmseqs_db = "UniProtKB/Swiss-Prot" only entails downloading a small database - doing a quick check, the fasta it's based on is only 87Mb. So that should potentially be feasible to run more automatedly?

jfy133 commented 4 months ago

@prototaxites

Yeah that definitely should be feasible! Is it a single file with a public URL?

prototaxites commented 4 months ago

@prototaxites

Yeah that definitely should be feasible! Is it a single file with a public URL?

"UniProtKB/Swiss-Prot" is the string passed to the mmseqs databases command, which downloads the latest release of the database AFAIK. Now that I think about it, I'm not sure there's a way to specify a version, unfortunately, which limits reproducibility.

Alternative would be to specify the URL of a fasta file to --metaeuk_db - in the MetaEuk module test, I passed it the yeast .faa in the test-data repo: https://github.com/nf-core/modules/blob/master/tests/modules/nf-core/metaeuk/easypredict/main.nf, which seemed to work OK, but it might be better to find a prokaryotic file to use with the test data.

jfy133 commented 1 month ago

List of tools that need to be somehow covered, where they are covered in currently:

tool config comment
adapterremoval test_adapterremoval maybe could be moved into ancient-dna, as they are people who mostly use it?
aria2 NONE used with checkm
bbmap/bbnorm test_bbnorm Short test
bcftools test_ancient_dna
cat test
checkm NONE
centrifuge test
concoct test_bin_refinement / test_concoct test_conoct: everything else deactivated due to very long run time
dastool test_binrefinement / test_ancient_dna
fastp test
fastqc test
freebayes test_ancient_dna
genomad test_virus_identification everything else turned off (necessary?)
gtdbtk NONE large db (make mini?)
gunc NONE does it have a large db?
gunzip test
krona test
maxbin test
metabat2 test
metaeuk test_adapterremoval
mmseqs NONE only if --metaeuk_mmseqs_db is supplied
multiqc test
prodigal test
prokka test
pydamage test_ancient_dna
samtools test_ancient_dna
seqtk test_bbnorm Short test
tiara test_adapterremoval note has a special DASTOOL_FASTATOCONTIGBIN_TIARA process that doesn't actually run DASTOOL!
bowtie2 (phix) test
bowtie2 (host) test_host_rm / test_hubrid_host_rm
bowtie2 (assembly) test
busco test
CAT NONE large db (make mini?)
filtlong test_hybrid / test_hybrid_rm
kraken2 test
megahit test
spades test
spadeshybrid test_hybrid / test_hybrid_rm
nanolyse test_hybrid / test_hybrid_rm
nanoplot test_hybrid / test_hybrid_rm
porechop test_hybrid / test_hybrid_rm
quast test
tiara test_adapter_removal

Additional:

context config
samplesheet input test
assembly input tesT_bin_Refinement
jfy133 commented 1 month ago

Proposal:

name description tools done
test default (incl. those that run pre-assembly if db supplied (skip metaeuk?) except concoct centrifuge, kraken2, krona yes
test_single_end test but with single end input (as current, as only skips steps where reads not needed) yes
test_alternatives all alternative tools adapterremoval, checkm, bin_domain_classification yes
test_preassembly_binrefine genomad, concoct, binning refinement (gunc), metaeuk conoct, gunc, metaeuk
test_hybrid_rm for long read, w/host remove
test_nothing everything off Yes
test_extras standard test but with additional opt-in functionality keep_phix, bbnorm, host_rm, genomad, ancient_dna, tiara,
test_bigdb tools with big databases (mini versions: CAT/GTDBK)
test_full as current
CarsonJM commented 1 month ago

@jfy133 this is fantastic! The old structure of tests was very confusing 😅