sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

support exporting lineage spreadsheet from LCA databases #1080

Open ctb opened 4 years ago

ctb commented 4 years ago

would be nice to be able to export the key bits of the lineage spreadsheet from an LCA database.

ctb commented 4 years ago

script export-lineage-csv-from-lca.py --

#! /usr/bin/env python
import sys
import csv

import sourmash
from sourmash.lca import lca_utils
from sourmash.lca import lca_db

db, _, _ = lca_db.load_single_database(sys.argv[1])
w = csv.writer(sys.stdout)
w.writerow(["name"] + list(lca_utils.taxlist()) )

for ident in db.ident_to_name:
    name = db.ident_to_name[ident]
    idx = db.ident_to_idx[ident]
    lid = db.idx_to_lid.get(idx)
    if lid is not None:
        lineage = db.lid_to_lineage[lid]

    outlist = [name] + lca_utils.zip_lineage(lineage)
    w.writerow(outlist)

use like so:

./export-lineage-csv-from-lca.py podar-ref.lca.json.gz > out.csv
sourmash lca index out.csv xyz.lca.json podar-ref.lca.json.gz

which yields

== This is sourmash version 3.3.2.dev9+g462bc387. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Building LCA database with ksize=31 scaled=10000 moltype=DNA.
examining spreadsheet headers...
** assuming column 'name' is identifiers in spreadsheet
64 distinct identities in spreadsheet out of 64 rows.
64 distinct lineages in spreadsheet out of 64 rows.
... loaded 1 signatures.
loaded 19993 hashes at ksize=31 scaled=10000
64 assigned lineages out of 64 distinct lineages in spreadsheet.
64 identifiers used out of 64 distinct identifiers in spreadsheet.
saving to LCA DB: xyz.lca.json

👍

ctb commented 4 years ago

(note that the above command builds the new xyz.lca.json using the signatures loaded directly from the old podar-ref.lca.json.gz.)

I think #969 is relevant to this issue...