sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

add docs about what the taxonomy input file format is #1790

Open ctb opened 2 years ago

ctb commented 2 years ago

right now it's not really specified anywhere 😆

ident,superkingdom,phylum,class,order,family,genus,species
GCF_014075335.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_000578955.1,d__Bacteria,p__Firmicutes,c__Bacilli,o__Staphylococcales,f__Staphylococcaceae,g__Staphylococcus,s__Staphylococcus aureus
ctb commented 2 years ago

What I wrote over in the genome-grist docs:

This file contains at least 8 columns, with the headers ident and superkingdom, phylum,class,order,family,genus,species.

ctb commented 2 years ago

see also taxonkit info https://github.com/sourmash-bio/sourmash/issues/1851

ctb commented 6 months ago

this topic, plus discussions about NCBI, GTDB, LINS, and ICTV taxonomies could usefully go in either https://sourmash.readthedocs.io/en/latest/databases-advanced.html or https://sourmash.readthedocs.io/en/latest/sourmash-internals.html#taxonomy-and-assigning-lineages. There's already some stuff in the latter location!