Open emmastrand opened 4 months ago
It's relatively easy to add a database, so maybe you could contribute this yourself? You need to provide one or two urls for download and a formatting script that outputs files suitable for DADA2's assignTaxonomy
and addSpecies
functions. The urls, together with some information, go into conf/ref_databases.config
and the formatting scripts reside in bin
. Here's the documentation for contributing to nf-core pipelines: https://nf-co.re/docs/contributing/contributing_to_pipelines. Eternal glory as a contributor to Ampliseq awaits you! :-)
Thanks for sharing this! Do other contributors have advice/tips/scripts for formatting a script that outputs files suitable for DADA2? This is mostly where I'm stuck.
You can view all formatting scripts in the bin
directory of the pipeline. The files look like the below.
assignTaxonomy.fna
:
>Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Rickettsiaceae;Rickettsia;Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA
I.e. each sequence's name is just the full taxonomy string.
addSpecies.fna
:
>GB_GCA_000012145.1 Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA
Here, each species has an accession followed by the species name. AFAIK, the accession is not used for anything, but I guess it has to be unique.
Your script just needs to output these two files with the above names starting from whatever you can download.
You can also use nf-core's Slack (#ampliseq channel) to discuss.
Description of feature
Hi there - I'm trying to use ampliseq for 12S amplicon data and running into issues adding our own custom database b/c of incompatible formatting. It would be great for ampliseq to have this amplicon option along with CO1, 16S, 18S, etc. This is an example of one database that we would use. https://mitofish.aori.u-tokyo.ac.jp/. Thanks!