nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
163 stars 107 forks source link

12S taxonomic classification databases #707

Open emmastrand opened 4 months ago

emmastrand commented 4 months ago

Description of feature

Hi there - I'm trying to use ampliseq for 12S amplicon data and running into issues adding our own custom database b/c of incompatible formatting. It would be great for ampliseq to have this amplicon option along with CO1, 16S, 18S, etc. This is an example of one database that we would use. https://mitofish.aori.u-tokyo.ac.jp/. Thanks!

erikrikarddaniel commented 4 months ago

It's relatively easy to add a database, so maybe you could contribute this yourself? You need to provide one or two urls for download and a formatting script that outputs files suitable for DADA2's assignTaxonomy and addSpecies functions. The urls, together with some information, go into conf/ref_databases.config and the formatting scripts reside in bin. Here's the documentation for contributing to nf-core pipelines: https://nf-co.re/docs/contributing/contributing_to_pipelines. Eternal glory as a contributor to Ampliseq awaits you! :-)

emmastrand commented 4 months ago

Thanks for sharing this! Do other contributors have advice/tips/scripts for formatting a script that outputs files suitable for DADA2? This is mostly where I'm stuck.

erikrikarddaniel commented 4 months ago

You can view all formatting scripts in the bin directory of the pipeline. The files look like the below.

assignTaxonomy.fna:

>Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Rickettsiaceae;Rickettsia;Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA

I.e. each sequence's name is just the full taxonomy string.

addSpecies.fna:

>GB_GCA_000012145.1 Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA

Here, each species has an accession followed by the species name. AFAIK, the accession is not used for anything, but I guess it has to be unique.

Your script just needs to output these two files with the above names starting from whatever you can download.

You can also use nf-core's Slack (#ampliseq channel) to discuss.