Implement CO1 taxonomic classification databases, or just custom database inputs?

tjcreedy commented 2 years ago

Description of feature

Hi (again, sorry).

We would like to test ampliseq with CO1 data as an automated generic metabarcoding workflow, however no CO1 taxonomic databases are currently implemented. MIDORI produces curated subsets of NCBI nt, formatted for use in a range of different classifiers (including DADA2, QIIME & RDP), for a range of loci. Annoyingly their download page isn't very easy to access programmatically, at least for me with limited webdev knowledge, but it can be viewed here: http://reference-midori.info/download.php# - for example, I would use the Latest_GenBankRelease249 > DADA2 > uniq > MIDORI_UNIQ_NUC_GB249_CO1_DADA2.fasta.gz file.

Is it on the roadmap to implement more databases within the ampliseq pipeline? And/or would it be feasible to add the option for users to supply their own databases for the classifier process(es)? It may be that the latter option is already possible with the use of config files, as I've recently learned.

Many thanks!

d4straub commented 2 years ago

Is it on the roadmap to implement more databases within the ampliseq pipeline?

Hi again, no trouble, its not on the roadmap but it would be good to add that! I cannot promise any timeline but will look into it eventually.

And/or would it be feasible to add the option for users to supply their own databases for the classifier process(es)?

It is possible but not simple using configs by using -c and overwriting a key for conf/ref_databases.config.

d4straub commented 2 years ago

I had a short look and only the latest release (GB249) includes DADA2 ready files (thats fine), but the download source is at http://reference-midori.info/download/Latest_GenBankRelease249/DADA2/uniq/MIDORI_UNIQ_NUC_GB249_CO1_DADA2.fasta.gz where the Latest_GenBankRelease249will turn eventually into an Archive/GenBankRelease249, at least when following the logic of the last release. That means the download link isnt stable. That means its pretty useless to add it to the pipeline, because released versions of the pipeline will have a fixed link (well, can be modified by -c, but someone who can do that can also use now -c...). Either (1) the database owners change that logic (i.e. provide stable links) or (2) store those files also at zenodo or similar, comparable to e.g. SILVA DADA2 reference. Otherwise (3) we would need to upload it to a aws bucket or such to make it reliably available (sort of last resort).

tjcreedy commented 2 years ago

Thanks for looking into this - I can see your reasoning about the stability of the links. I'll contact the MIDORI authors and ask if making stable links are on their roadmap - they seem to frequently add new classifiers and database types so that flexibility may stretch to modifying their data structure.

erikrikarddaniel commented 2 years ago

I have colleagues working on coI annotation. I think they're into creating a curated database, like we did with GTDB. IIRC, they're also looking into other tools for this than the DADA2 functions. This shouldn't be too far away in the future, and it has sounded like they were willing to contribute code with a PR.

d4straub commented 2 years ago

Sounds great!

erikrikarddaniel commented 2 years ago

I just heard from my colleague that he doesn't expect co1 annotation to be ready before June but rather in early autumn.

d4straub commented 2 years ago

Alright, still, nice to hear thats on the horizon!

The download part of the MIDORI website was changed so that there are stable links for all database versions. That looks usable now. I have the plan to try it out, not sure when I'll find time.

d4straub commented 2 years ago

@tjcreedy I have added quick & dirty the MIDORI CO1 dataset. Could you please test the branch add-midori-databases of my repo fork (d4straub/ampliseq) with:

nextflow pull d4straub/ampliseq
nextflow pull d4straub/ampliseq -r add-midori-databases
nextflow run d4straub/ampliseq -r add-midori-databases <your parameters> --dada_ref_taxonomy "midori-co1=gb249"

I think adding species isnt working yet, in that case you can also try with --skip_dada_addspecies. It might well be that taxa ranks are wrongly annotated. The question here is whether the annotation seems reasonable to you (I have no experience with CO1). _edit: sorry, forgot to mention the new key for --dada_reftaxonomy, which is quite important in that test! I added it above.

tjcreedy commented 2 years ago

Thanks a lot for your interest and discussion on this.

Working with Arthropods, taxonomic classification of CO1 sequence is an area with a lot of active interest but a high bar for accuracy. Some of the major methods used are BLAST vs NT then processing with MEGAN or R taxonomizer, using USEARCH or VSEARCH sintax against a database (we use MIDORI for this). RDP and PPLACER are also widely used, and phylogenetic based methods are being developed although these require much better reference sets.

I've tested with your above commands (without --skip_dada_addspecies) with a small CO1 dataset and the workflow completes successfully with no issues, as far as I can tell. The taxonomy assignment isn't great - nothing is classified below Phylum level, where we'd hope for around family level assignments for this dataset. Hence I can't really tell if species assignment worked or not. I need to do some further investigation and comparison against other classifiers to explore the issue further, I'll get back to you if it seems to be something about the ampliseq implementation of dada2.

erikrikarddaniel commented 2 years ago

Den tis 17 maj 2022 kl 15:40 skrev tjcreedy @.***>:

Thanks a lot for your interest and discussion on this.

Working with Arthropods, taxonomic classification of CO1 sequence is an area with a lot of active interest but a high bar for accuracy. Some of the major methods used are BLAST vs NT then processing with MEGAN or R taxonomizer, using USEARCH or VSEARCH sintax against a database (we use MIDORI for this). RDP and PPLACER are also widely used, and phylogenetic based methods are being developed although these require much better reference sets.

Pplacer is certainly something I'd like to have in the pipeline -- there's an issue someplace, I think. For lack of time and due to the complexity of gathering reference sequences and trees, the work stalled I'm afraid, but at some point I'd like to bring this up again.

I've tested with your above commands (without --skip_dada_addspecies) with a small CO1 dataset and the workflow completes successfully with no issues, as far as I can tell. The taxonomy assignment isn't great - nothing is classified below Phylum level, where we'd hope for around family level assignments for this dataset. Hence I can't really tell if species assignment worked or not. I need to do some further investigation and comparison against other classifiers to explore the issue further, I'll get back to you if it seems to be something about the ampliseq implementation of dada2.

— Reply to this email directly, view it on GitHub https://github.com/nf-core/ampliseq/issues/427#issuecomment-1128885113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALTHQABRIBD4IEAWMS66OLVKOOWFANCNFSM5VX7536Q . You are receiving this because you commented.Message ID: @.***>

d4straub commented 2 years ago

I've tested with your above commands (without --skip_dada_addspecies) with a small CO1 dataset and the workflow completes successfully with no issues, as far as I can tell. The taxonomy assignment isn't great - nothing is classified below Phylum level, where we'd hope for around family level assignments for this dataset.

Thanks for testing! Could you please run the same command as above (probably with -resume) and with --cut_dada_ref_taxonomy, this could improve results.

tjcreedy commented 2 years ago

This does improve results, quite substantially, thank you - I hadn't seen that option yet and it really makes a difference. I've done this sort of trimming before in other projects but haven't applied it to classification before, as other classifiers seem to generate similar levels of results without it.

I can certainly say that the inclusion of this database in ampliseq is functional for CO1 and gives reasonable outputs with both the higher taxonomy and species-level classification. I'm going to compare further against other databases and classifiers and if you're interested I can feed back if I have further suggestions on databases or classifiers that might be useful for ampliseq integration.

Thanks!

d4straub commented 2 years ago

Happy to hear that --cut_dada_ref_taxonomy improved results! Will need to benchmark this one day properly with a standard/mock dataset (to decide whether it makes sense to have that set by default).

Yes please give feedback about your experiences with classifiers (or any other).

johnne commented 2 years ago

@tjcreedy, as @erikrikarddaniel mentioned we are working on a COI database put together using data at https://hosted-datasets.gbif.org/. The tool we use for creating the database is called coid. What it does is:

Download BOLD sequences, metadata and taxonomic backbone from GBIF
Filter the sequences to gene/taxa of choice (default is COI-5P gene and all taxa
Further filtering of records to e.g. remove duplicates, limit to BOLD BINs, fill in unassigned taxonomic ranks etc.

This has taken a lot of effort and we've hit several obstacles along the way, but we think it's looking pretty good at the moment. We've also performed some benchmarking of taxonomic assignment tools using train/test sequences from this database and it looks ok as well, with just the one drawback that we're seeing weird results with DADA2 (see #1511).

If you want you can try the coidb tool (installable from source or from conda (conda install -c bioconda coidb)). We're also planning to share the database publicly and work it into ampliseq in the manner described above by others.

tjcreedy commented 2 years ago

@johnne That sounds great, thank you for the information this looks like a really useful tool. I'll include this in my testing as well.

tjcreedy commented 2 years ago

@d4straub a small update on this, comparing the results of dada2 using the MIDORI database against other classifications: we used some arthropod samples, with known non-target reads present, and found that the dada2 + MIDORI classifier worked fairly well for metazoa, but poorly for non-metazoa and non-eukaryotes. As MIDORI only contains eukaryotes, we expected bacteria to simply not be classified, but unfortunately they were usually classified as insecta, which is not ideal.

We found that sintax, implemented within VSEARCH, also using the MIDORI database, was better at not classifying bacteria at all. However generally sintax was generally more conservative than dada2, all else being as equal as possible. We also used BLAST vs NCBI, with different approaches to filtering/parsing the results. We actually found that this was the most consistent way to get high-confidence taxonomic assignments to species or genus level. We're yet to get coidb working.

This wasn't in any way a comprehensive examination, our main goal is to get reasonable classifications and ensure some consistency with previous approaches. As far as ampliseq goes, implementing a different classifier is probably not worth the effort for the relatively small use cases, but I do wonder if being able to integrate custom databases in a straightforward way as a user might be widely useful - for example, this would allow me to test expanding the MIDORI database with some non-eukaryote sequences to improve classification, or allow testing of different versions of the coidb database or other datasets available as they are improved.

d4straub commented 2 years ago

the dada2 + MIDORI classifier worked fairly well for metazoa, but poorly for non-metazoa and non-eukaryotes. As MIDORI only contains eukaryotes, we expected bacteria to simply not be classified, but unfortunately they were usually classified as insecta, which is not ideal.

Good to hear that it works in general.

As far as ampliseq goes, implementing a different classifier is probably not worth the effort for the relatively small use cases

Well, I have not benchmarked classifiers specifically, and if it turns out VSEARCH is also for other cases better, it would make sense to add it. Let me know if you come across something in that direction!

but I do wonder if being able to integrate custom databases in a straightforward way as a user might be widely useful - for example, this would allow me to test expanding the MIDORI database with some non-eukaryote sequences to improve classification, or allow testing of different versions of the coidb database or other datasets available as they are improved.

I agree, a custom database for DADA2 taxonomic classification might be a good idea. I'll have a look into it, but cannot promise when. It might get tricky to provide the user with appropriate warnings/errors when doing so, but on the other hand when providing a custom database I expect a little effort from the user anyway.

d4straub commented 2 years ago

Hi there @tjcreedy , in the dev branch, custom database input is now possible with --dada_ref_tax_custom, --dada_ref_tax_custom_sp and possibly dada_assign_taxlevels.

Also, MIDORI2 CO1 database is available with --dada_ref_taxonomy "midori2-co1" or "midori2-co1=gb250".

It would be great if you have the time to test those functions and give feedback.

tjcreedy commented 2 years ago

Hi @d4straub, thanks a lot. Will be able to test these out with my data next week and will let you know how I get on.

tjcreedy commented 2 years ago

Hi @d4straub I have tested the dev branch using MIDORI2 and using the database generated by coidb. @johnne's coidb is very nice and it's so convenient that it produces dada2 format outputs. Both seem to run without issue on our test data and produce taxonomic assignments.

As a note for interest, one issue we have with CO1 is that because of the primer generality, we amplify a lot of unwanted CO1 analogues from non-Eukaryotes (particularly an issue in eDNA studies), Since both databases only comprise Eukaryotes we often have ASVs classified as arthropods that come back as bacterial or fungal from BLAST against NT. Sometimes these non-eukaryotic ASVs have lower bootstrap values in the taxonomy assignment so these could in theory be filtered by setting minBoot in assignTaxonomy then implementing an option for discarding ASVs with no taxonomy. But many non-eukaryotic ASVs have as good or better bootstrap values than eukaryotic ASVs, so this might not be too useful.

Instead for our work I'm going to build a curated database from NCBI that contains non-Eukaryotes, perhaps using RESCRIPt this would then work with your new (and much appreciated) custom database input in ampliseq and can use existing taxonomy filtering options to screen for non-targets.

Thanks a lot for your support! From my point of view the original feature request is definitely fulfilled.

d4straub commented 2 years ago

Thanks for the feedback! Glad to hear it works.

As a note for interest, one issue we have with CO1 is that because of the primer generality, we amplify a lot of unwanted CO1 analogues from non-Eukaryotes (particularly an issue in eDNA studies), Since both databases only comprise Eukaryotes we often have ASVs classified as arthropods that come back as bacterial or fungal from BLAST against NT. Sometimes these non-eukaryotic ASVs have lower bootstrap values in the taxonomy assignment so these could in theory be filtered by setting minBoot in assignTaxonomy then implementing an option for discarding ASVs with no taxonomy. But many non-eukaryotic ASVs have as good or better bootstrap values than eukaryotic ASVs, so this might not be too useful.

You should be able to freely modify the minBoot by using -c assignTaxonomy.config that contains

process {
    withName: DADA2_TAXONOMY {
        ext.seed = "${params.seed}"
        ext.args = [
            'minBoot = 50',
            params.pacbio ? "tryRC = TRUE" :
                params.iontorrent ? "tryRC = TRUE" : ""
        ].join(',').replaceAll('(,)*$', "")
        publishDir = [
            [
                path: { "${params.outdir}/dada2/args" },
                mode: params.publish_dir_mode,
                pattern: "*.args.txt"
            ],
            [
                path: { "${params.outdir}/dada2" },
                mode: params.publish_dir_mode,
                pattern: "*.tsv"
            ]
        ]
    }
}

with your choice of minBoot. Otherwise it might be interesting to add a custom DB with eukaryotic and a few non-eukaryotic sequences to catch the false positives. But that would need benchmarking I assume.

I'll close that issue, please feel free to open another one when you have more suggestions. But pull requests are also welcome!

nf-core / ampliseq

Implement CO1 taxonomic classification databases, or just custom database inputs? #427

Description of feature