tracking gtdb databases

ctb commented 4 years ago

I've recently been working on building sourmash taxonomy databases built on the GTDB alternate taxonomy (c.f. https://github.com/Ecogenomics/GtdbTk, etc.). This has about 25,000 genomes, most of which are in genbank but some of which are not (see the UBA entries, in particular).

This issue is to track that work.

Conveniently, the GTDB releases (I'm using release89) contain all of the genomes in fastani/database/. So you can just calculate signatures for those!

The taxonomy TSV under taxonomy/ is in almost the right format for us. I've written some parsing scripts to update things; see especially update-gtdb-taxonomy.py in the sourmash_databases GTDB PR.

I'm putting the complete GTDB LCA databases here, https://osf.io/wxf9z/.

ctb commented 4 years ago

Interesting, turns out that the taxonomy/taxonomy.tsv file is a kind of standard format that is used by GreenGenes and others. So it's unlikely to change, and we can more or less rely on the format.

ctb commented 4 years ago

SBTs now available on OSF.

ctb commented 4 years ago

I got curious about sourmash genome classification in comparisons to GTDB-Tk, so I spent some time on it yesterday and today.

Using the sourmash k=21 LCA database (https://osf.io/2jp9n/), I analyzed 336 randomly chosen genbank genomes with both sourmash and GTDB-Tk, and wrote a script to compare the classifications.

A summary of the results is:

if sourmash lca classify yields a species-level designation, it is identical to what gtdb-tk produces. (Note that sourmash lca classify gives genus or species level classifications for about 95% of the 420,000 genomes in GenBank with the GTDB taxonomy at k=21.)
at k=21, sourmash lca classify will never disagree with GTDB-Tk. At worse it will fail to classify out to species, genus, etc. level.
sourmash lca classify takes about 35 seconds to classify 336 signatures, vs 2 hours with GTBD-Tk with 8 threads. (Calculating the signatures takes another 2 minutes.)
sourmash lca classify needs about 4 GB of RAM.

My test isn’t perfect, because I didn’t do any balancing phylogenetically (so there were a lot of salmonella genomes :), but based on the way ANI and MinHash behave, the results make sense. I’ll do some more tests as I process Genbank-wide classification results.

So, it seems like sourmash lca classify is a decent prefilter for GTDB-Tk, and that if you need to classify a lot of genomes quickly, you could start with sourmash and then focus in on the ones that aren’t classified at the species level.

The commands are:

sourmash compute -k 21,31,51 —scaled=1000 *.fna.gz
sourmash lca classify --query *.sig --db gtdb-release89-k31.lca.json.gz  > lca-classify-all-k31.txt

The script to do the comparison is here,

https://github.com/dib-lab/2019-sourmash-gtdb/blob/master/compare-lca-gtdbtk.py

ctb commented 3 years ago

closed by #1581.

sourmash-bio / sourmash

tracking gtdb databases #802