Open jfy133 opened 1 year ago
Metaeuk
For MetaEuk, specifying params.metaeuk_mmseqs_db = "UniProtKB/Swiss-Prot"
only entails downloading a small database - doing a quick check, the fasta it's based on is only 87Mb. So that should potentially be feasible to run more automatedly?
@prototaxites
Yeah that definitely should be feasible! Is it a single file with a public URL?
@prototaxites
Yeah that definitely should be feasible! Is it a single file with a public URL?
"UniProtKB/Swiss-Prot" is the string passed to the mmseqs databases
command, which downloads the latest release of the database AFAIK. Now that I think about it, I'm not sure there's a way to specify a version, unfortunately, which limits reproducibility.
Alternative would be to specify the URL of a fasta file to --metaeuk_db
- in the MetaEuk module test, I passed it the yeast .faa in the test-data repo: https://github.com/nf-core/modules/blob/master/tests/modules/nf-core/metaeuk/easypredict/main.nf, which seemed to work OK, but it might be better to find a prokaryotic file to use with the test data.
List of tools that need to be somehow covered, where they are covered in currently:
tool | config | comment |
---|---|---|
adapterremoval | test_adapterremoval | maybe could be moved into ancient-dna, as they are people who mostly use it? |
aria2 | NONE | used with checkm |
bbmap/bbnorm | test_bbnorm | Short test |
bcftools | test_ancient_dna | |
cat | test | |
checkm | NONE | |
centrifuge | test | |
concoct | test_bin_refinement / test_concoct | test_conoct: everything else deactivated due to very long run time |
dastool | test_binrefinement / test_ancient_dna | |
fastp | test | |
fastqc | test | |
freebayes | test_ancient_dna | |
genomad | test_virus_identification | everything else turned off (necessary?) |
gtdbtk | NONE | large db (make mini?) |
gunc | NONE | does it have a large db? |
gunzip | test | |
krona | test | |
maxbin | test | |
metabat2 | test | |
metaeuk | test_adapterremoval | |
mmseqs | NONE | only if --metaeuk_mmseqs_db is supplied |
multiqc | test | |
prodigal | test | |
prokka | test | |
pydamage | test_ancient_dna | |
samtools | test_ancient_dna | |
seqtk | test_bbnorm | Short test |
tiara | test_adapterremoval | note has a special DASTOOL_FASTATOCONTIGBIN_TIARA process that doesn't actually run DASTOOL! |
bowtie2 (phix) | test | |
bowtie2 (host) | test_host_rm / test_hubrid_host_rm | |
bowtie2 (assembly) | test | |
busco | test | |
CAT | NONE | large db (make mini?) |
filtlong | test_hybrid / test_hybrid_rm | |
kraken2 | test | |
megahit | test | |
spades | test | |
spadeshybrid | test_hybrid / test_hybrid_rm | |
nanolyse | test_hybrid / test_hybrid_rm | |
nanoplot | test_hybrid / test_hybrid_rm | |
porechop | test_hybrid / test_hybrid_rm | |
quast | test | |
tiara | test_adapter_removal |
Additional:
context | config |
---|---|
samplesheet input | test |
assembly input | tesT_bin_Refinement |
Proposal:
name | description | tools | done |
---|---|---|---|
test | default (incl. those that run pre-assembly if db supplied (skip metaeuk?) except concoct | centrifuge, kraken2, krona | yes |
test_single_end | test but with single end input (as current, as only skips steps where reads not needed) | yes | |
test_alternatives | all alternative tools | adapterremoval, checkm, bin_domain_classification | yes |
test_preassembly_binrefine | genomad, concoct, binning refinement (gunc), metaeuk | conoct, gunc, metaeuk | |
test_hybrid_rm | for long read, w/host remove | ||
test_nothing | everything off | Yes | |
test_extras | standard test but with additional opt-in functionality | keep_phix, bbnorm, host_rm, genomad, ancient_dna, tiara, | |
test_bigdb | tools with big databases (mini versions: CAT/GTDBK) | ||
test_full | as current |
@jfy133 this is fantastic! The old structure of tests was very confusing 😅
Description of feature
A major problem we currently have during development is our CI tests are nowhere near comprehensive enough due to the pipeline utilising extremely large database files that do not fit in GHA resource allocations.
We should develop and document a suite of manual tests developers should run on their own infrastructure to ensure the pipeline is indeed working as intended.
mag missing configs and tests
For Automated CI
For manual CI
Does not need a database
Datbases on AWS
Databases NOT on AWS