Closed ctb closed 6 years ago
Some more thoughts:
sourmash lca
Current output:
% kraken/classify.py genbank/nodes.dmp genbank/names.dmp genbank-k31.lca sig-to-classify.sig
loading taxonomic nodes from: genbank/nodes.dmp
loading taxonomic names from: genbank/names.dmp
loading k-mer DB from: genbank-k31.lca
loading signatures from 1 signature files
loaded 1 signatures total at k=31
downsampling to scaled value: 10000
found LCA classifications for 411 of 411 hashes
percent below at node code taxid name
100.0 411 0 - 131567 cellular organisms
100.0 411 23 - 2 Bacteria
94.4 388 15 - 1783272 Terrabacteria group
90.75 373 0 P 201174 Actinobacteria
90.75 373 7 C 1760 Actinobacteria
89.05 366 8 O 85007 Corynebacteriales
87.1 358 0 F 1762 Mycobacteriaceae
87.1 358 25 G 1763 Mycobacterium
81.02 333 329 - 77643 Mycobacterium tuberculosis complex
0.97 4 4 S 1773 Mycobacterium tuberculosis
LCA files for genbank-k21, k31, and k51 are available on the OSF under sourmash-lca-mark1
.
They were built with the command
python gist/extract.py genbank*.csv.gz nodes.dmp --traverse-directory .sbt.genba
nk-k21/ --savename genbank-k21.lca -k 21 --scaled 10000
and each took approximately 2 hours and 6 GB of RAM to build on the MSU HPCC;
lca.o46877939: resources_used.walltime = 02:51:58
lca.o46877939: resources_used.vmem = 6294516kb
More TODO items:
sourmash lca search
and sourmash lca index
(?), w/tests etc.Random other thoughts:
Fixed in #367.
See https://gist.github.com/ctb/9deb40a68108256ab4fd84c6b8e92e01 for implementation.
cc @brooksph @taylorreiter