Add GBIF adapted CO1 database

erikrikarddaniel commented 2 years ago

Description of feature

@johnne is working on making a version of the CO1 database better adapted to the GBIF backbone taxonomy. When that is available, make sure it's selectable in Ampliseq.

johnne commented 1 year ago

The database is up now at https://scilifelab.figshare.com/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192

johnne commented 1 year ago

@erikrikarddaniel How was it now with taxonomic ranks used in ampliseq? I see the finished assignTaxonomy.fna file has e.g. Bacteria;Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus as the header so it's using taxlevels "Domain", "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species". The custom coidb has "Kingdom", "Phylum", "Class","Order","Family","Genus","Species" as taxlevels. So should we aim to create a assignTaxonomy.fna file that duplicates the Kingdom value to Domain or instead use the database with --dada_assign_taxlevels that I see in the documentation?

Cheers, John

erikrikarddaniel commented 1 year ago

I think it would be best to add a taxlevels entry in the conf/ref_databases.config; see e.g. the midori2-co1 entry.

johnne commented 1 year ago

@erikrikarddaniel @jtangrot I'm a little unsure on how to use the fmtscript part of the database entries in the workflow. Does it have to be a shell script, or can it be a python script in the bin/ folder? And do you have suggestion for how to test if/how such a script works within the workflow?

erikrikarddaniel commented 1 year ago

It can, AFAIK, be any kind of script but it will be executed by the current container, so the interpreter needs to be in that and findable by whatever you have in #!. The best test is to add a test for the database you're adding. Add a config file, include that in nextflow.config and add to .github/workflows/ci.yml.

johnne commented 1 year ago

I wasn't able to use a python script because python is not included in the ubuntu:20.04 image used for the FORMAT_TAXONOMY process.

I suggest adding a container keyword to the ref_databases.config file that allows to use any container for the formatting script. See this commit

d4straub commented 1 year ago

If using that container keyword works fine so far this seems to me as a valid option. However, that would need to be also working for conda & singularity, that might complicate things? Alternatively, the ubuntu container isnt set in stone and you could have a look for another lightweight container that includes both, python and bash. That way one container (per conda, docker & singularity) would serve it all.

erikrikarddaniel commented 1 year ago

I agree with my namesake, and was perhaps leaning towards relying on one image that includes all, which means both python and bash, but also sed. Not working with singularity would of course be a show stopper.

johnne commented 1 year ago

It's probably not too difficult to make it work for singularity as well (it works already for conda) it's more that my groovy/nextflow skills are limited still. But having a universal image sounds better. I'll see what I can find. Is there a specific reason you're pulling singularity images from https://depot.galaxyproject.org/singularity/?

johnne commented 1 year ago

@d4straub How lightweight does the container have to be? Of course it has to contain bash, so the slimmest python containers are out. But I see nextflow also relies on the ps command which isn't included in some of the slightly bigger (~ 110M) containers that do contain bash. The python:3.7 container seems to work though, but it's 900M.

d4straub commented 1 year ago

Is there a specific reason you're pulling singularity images from https://depot.galaxyproject.org/singularity/?

I heard (never tried myself) that pulling docker images with singularity is more resource intensive and might be less reproducible than using singularity images directly (which are at https://depot.galaxyproject.org/singularity/).

How lightweight does the container have to be?

No specific threshold, I just wanted to make aware that some images are really unnecessarily large for the task at hand. But just take what works. I use singularity and the container size varies from 11MB (vsearch) to 2.5GB (QIIME2). For example CUTADAPT_SUMMARY with depot.galaxyproject.org-singularity-python-3.8.3.img is only 91MB. Processes ASSIGNSH & TRUNCLEN & FILTER_STATS & FORMAT_TAXRESULTS use pandas:1.1.5 for python scripts with the singularity container being depot.galaxyproject.org-singularity-pandas-1.1.5.img 97MB, that seems also random to me and probably we used there just a small image. edit: its also of advantage to take the same container for several processes instead of different containers in each process, because than just one container needs to be downloaded. In practice that might have actually low impact.

johnne commented 1 year ago

It works with quay.io/biocontainers/python:3.8.3 for docker. However, I don't feel confident in getting all checks to work, see https://github.com/NBISweden/ampliseq/pull/2.

d4straub commented 1 year ago

In the meantime

nf-core tools had an update --> nf-core linting fails in your fork (nf-core/ampliseq dev branch had its update and is passing)
nextflow had an update --> ugly warnings about process names
nf-core/ampliseq dev branch had updates that you seem not to have pulled yet

but nextflow run nf-core/ampliseq -r dev -profile test,singularity still (newest nextflow version) works locally for me (despite warnings) and nextflow run nf-core/ampliseq -r dev -profile test_iontorrent,singularity is also fine. Just confirmed. Not sure that helps though...

d4straub commented 1 year ago

That seems to be completed by #518 and #534 ?

erikrikarddaniel commented 1 year ago

I would say so, yes.

d4straub commented 1 year ago

Then I'll close it for now.

nf-core / ampliseq

Add GBIF adapted CO1 database #472

Description of feature