thomasstjerne / blast-ws

MIT License
0 stars 0 forks source link

Potential further Reference databases to include (irrespective of what kind of preparation they need) #11

Open tobiasgf opened 10 months ago

tobiasgf commented 10 months ago

SILVA "SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya)." Note: earlier this was the one big rDNA reference database. It is being re-designed in the coming years. GBIF S is in contact with developers.

NAMERS "NAMERS is a data portal of high quality DNA reference sequences generated for use with environmental DNA technologies. It’s current taxonomic focus is freshwater fish of British Columbia, Canada" Note: This database is based on genome skimming, and contains sequences for most mitochondrial marker regions on the mitochondrion for the targeted species (Canada, freshwater). The scope is meant to increase.

MIDORI2 Publ: Leray et al 2022 "MIDORI2 is a reference database of DNA and amino acid sequences used for taxonomic assignments of Eukaryota mitochondrial DNA sequences. Currently, the databases are available for download in seven formats. Since version GB237, MIDORI 2 includes not only Metazoan but also all Eukaryota sequences. Since version GB242, MIDORI 2 provides two types of databases, 1) with and 2) without binomial species description, such as "cf.," "aff.," and "sp." Since version GB243, we also provide amino acid sequence databases." Notes: Midori is becoming more widely used. Has CO1, CytB, etc. based on GenBank.

CALeDNA databases "These databases were made using the CRUX Pipeline, part of the Anacapa Toolkit (Curd et al., 2019 in MEE). We update these databases annually. If you are using a different primer or locus, we encourage you to make your own CRUX database. Let us know if you want additional reference libraries or if you need help making your own. " Notes: By now includes: 16S: min size 60, max size 400 | 18S: min size 80, max size 550 | PITS: min size 100, max size 800 | CO1: min size 100, max size 700 | FITS: min size 80, max size 700 | trnL: min size 33, max size 225 | Vertebrate 12S: min size 40, max size 150.

Mare_MAGE "The Mare-MAGE database contains quality-checked sequences of the mitochondrial 12S ribosomal RNA and Cytochrome c Oxidase I gene. All sequences were obtained from the National Center for Biotechnology Information- GenBank (NBCI-GenBank), the European Nucleotide Archive (ENA), AquaGene Database and BOLD database, and have undergone intensive processing. They were checked for false annotations and non-target anomalies, according to the Integrated Taxonomic Information System (ITIS) and FishBase. The dataset is compiled in ARB-Home, FASTA and Qiime2 formats, and is publicly available from the Mare-MAGE database website (http://mare-mage.weebly.com/)."

COInrCOInr and mkCOInr: Building and customizing a nonredundant barcoding reference database from BOLD and NCBI using a semi-automated pipeline "The mkcoinr tool is a series of Perl scripts designed to download sequences from BOLD and NCBI, to build the COInr database and to customize it according to the users’ needs. It is possible to select or eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for blast, vtam, qiime and rdp classifier."