sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

utility: kraken-style LCA classification on banded signatures #302

Closed ctb closed 6 years ago

ctb commented 7 years ago

See https://gist.github.com/ctb/9deb40a68108256ab4fd84c6b8e92e01 for implementation.

cc @brooksph @taylorreiter

ctb commented 7 years ago

Some more thoughts:

ctb commented 7 years ago
ctb commented 7 years ago

Current output:

% kraken/classify.py genbank/nodes.dmp genbank/names.dmp genbank-k31.lca sig-to-classify.sig
loading taxonomic nodes from: genbank/nodes.dmp
loading taxonomic names from: genbank/names.dmp
loading k-mer DB from: genbank-k31.lca
loading signatures from 1 signature files
loaded 1 signatures total at k=31
downsampling to scaled value: 10000
found LCA classifications for 411 of 411 hashes
percent below   at node code    taxid   name
100.0   411     0       -       131567  cellular organisms
100.0   411     23      -       2       Bacteria
94.4    388     15      -       1783272 Terrabacteria group
90.75   373     0       P       201174  Actinobacteria
90.75   373     7       C       1760    Actinobacteria
89.05   366     8       O       85007   Corynebacteriales
87.1    358     0       F       1762    Mycobacteriaceae
87.1    358     25      G       1763    Mycobacterium
81.02   333     329     -       77643   Mycobacterium tuberculosis complex
0.97    4       4       S       1773    Mycobacterium tuberculosis

LCA files for genbank-k21, k31, and k51 are available on the OSF under sourmash-lca-mark1.

They were built with the command

python gist/extract.py genbank*.csv.gz nodes.dmp --traverse-directory .sbt.genba
nk-k21/ --savename genbank-k21.lca -k 21 --scaled 10000

and each took approximately 2 hours and 6 GB of RAM to build on the MSU HPCC;

lca.o46877939:    resources_used.walltime = 02:51:58
lca.o46877939:    resources_used.vmem = 6294516kb
ctb commented 7 years ago

More TODO items:

ctb commented 7 years ago

Random other thoughts:

ctb commented 6 years ago

Fixed in #367.