sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
474 stars 80 forks source link

tracking gtdb databases #802

Closed ctb closed 3 years ago

ctb commented 4 years ago

I've recently been working on building sourmash taxonomy databases built on the GTDB alternate taxonomy (c.f. https://github.com/Ecogenomics/GtdbTk, etc.). This has about 25,000 genomes, most of which are in genbank but some of which are not (see the UBA entries, in particular).

This issue is to track that work.

Conveniently, the GTDB releases (I'm using release89) contain all of the genomes in fastani/database/. So you can just calculate signatures for those!

The taxonomy TSV under taxonomy/ is in almost the right format for us. I've written some parsing scripts to update things; see especially update-gtdb-taxonomy.py in the sourmash_databases GTDB PR.

I'm putting the complete GTDB LCA databases here, https://osf.io/wxf9z/.

ctb commented 4 years ago

Interesting, turns out that the taxonomy/taxonomy.tsv file is a kind of standard format that is used by GreenGenes and others. So it's unlikely to change, and we can more or less rely on the format.

ctb commented 4 years ago

SBTs now available on OSF.

ctb commented 4 years ago

I got curious about sourmash genome classification in comparisons to GTDB-Tk, so I spent some time on it yesterday and today.

Using the sourmash k=21 LCA database (https://osf.io/2jp9n/), I analyzed 336 randomly chosen genbank genomes with both sourmash and GTDB-Tk, and wrote a script to compare the classifications.

A summary of the results is:

My test isn’t perfect, because I didn’t do any balancing phylogenetically (so there were a lot of salmonella genomes :), but based on the way ANI and MinHash behave, the results make sense. I’ll do some more tests as I process Genbank-wide classification results.

So, it seems like sourmash lca classify is a decent prefilter for GTDB-Tk, and that if you need to classify a lot of genomes quickly, you could start with sourmash and then focus in on the ones that aren’t classified at the species level.

The commands are:

sourmash compute -k 21,31,51 —scaled=1000 *.fna.gz
sourmash lca classify --query *.sig --db gtdb-release89-k31.lca.json.gz  > lca-classify-all-k31.txt

The script to do the comparison is here,

https://github.com/dib-lab/2019-sourmash-gtdb/blob/master/compare-lca-gtdbtk.py

ctb commented 3 years ago

closed by #1581.