Subsetting LCA database by taxonomic lineage

ccbaumler commented 1 year ago

Hello, as of #2753 I have been able to build my own LCA databases from my existing sourmash signature databases.

I was interested in subsetting the current LCA database by taxonomy (i.e. Filtering out everything above/below a specific rank). Any advice is appreciated! I think I will begin by modifying #1784 for multiple lineages...

bluegenes commented 1 year ago

Hi @ccbaumler

I think you want to add an argument for the rank (or ranks) with LCA hashes you want to keep, e.g. species.

then, here's where you would want to modify:

# find all relevant lineage IDs (lid)
    matching_lids = set()
    for lineage, lid in db.lineage_to_lid.items():
        for (rank, name) in lineage:
            # could do flexible matching here
            if name == args.lineage_str:
                print(f"found: {name} at rank {rank}")
                full_lineage = ";".join(zip_lineage(lineage))
                print(f"full lineage lid={lid}: {full_lineage}")
                matching_lids.add(lid)
                break

change name == args.lineage_str to: lineage[-1].rank == args.keep_rank

and remove print(f"found: {name} at rank {rank}")

ccbaumler commented 1 year ago

Thank you for the advice, @bluegenes ! I tried lineage[-1].rank == args.keep_rank but since my last value in lineage is strain it gave an error. I had to update some of the API indices as well. I used print(dir(db)) to see all the attributes of the database created and went with those values.

So far, I have written a script that sort of works. The problem is that it either creates a single sig file for each hash or a database of single signatures for each hash. Any advice on how to return a concatenated list of hashs that share a name would be very appreciated.

Here is the script:

#! /usr/bin/env python
import sys
import argparse
import sourmash

from sourmash.lca.lca_utils import zip_lineage
from sourmash import MinHash, SourmashSignature

def _create_minhash(db):
    is_protein = False
    is_hp = False
    is_dayhoff = False
    if db.moltype == 'protein':
        is_protein = True
    elif db.moltype == 'hp':
        is_hp = True
    elif db.moltype == 'dayhoff':
        is_dayhoff = True

    minhash = MinHash(n=0, ksize=db.ksize, scaled=db.scaled,
                      is_protein=is_protein, hp=is_hp, dayhoff=is_dayhoff)
    return minhash

def main():
    p = argparse.ArgumentParser()
    p.add_argument('lca_db')
    p.add_argument('keep_rank', help="Keep the rank specified (i.e. Species)")
    p.add_argument('-o', '--output', help="output signature containing rank-specific hash values", required=True)
    p.add_argument('-s', '--save-names', help="save list of matching signature names from that lineage to this file")
    args = p.parse_args()

    db, ksize, scaled = sourmash.lca.lca_db.load_single_database(args.lca_db)
    print(f"loaded db k={ksize} scaled={scaled} from '{args.lca_db}'")

    # find all relevant lineage IDs (lid)
    matching_lids = set()
    for lineage, lid in db._lineage_to_lid.items():
        for (rank) in lineage:
            if any(taxon.rank == args.keep_rank for taxon in lineage):
                full_lineage = ";".join(zip_lineage(lineage))
                print(f"full lineage lid={lid}: {full_lineage}")
                matching_lids.add(lid)
                break

    # find all matching idx
    matching_idx = set()
    for idx, lid in db._idx_to_lid.items():
        if lid in matching_lids:
            matching_idx.add(lid)

    print(f"found {len(matching_idx)} matching genomes.")

    individual_signatures = []  # Create an empty list to store individual signatures

    for hashval, idx_set in db._hashval_to_idx.items():
        idx_set = set(idx_set)
        if idx_set.issubset(matching_idx):
            # Find the name associated with the matching idx
            name = None
            for ident, idx in db._ident_to_idx.items():
                if idx in idx_set:
                    name = db._ident_to_name[ident]
                    break

            if name:
                #full_lineage = ";".join(zip_lineage(lineage))  # Get full lineage as a string
                mh = _create_minhash(db)
                mh.add_hash(hashval)  # Use add_hash method to add a single hash value
                ss = SourmashSignature(mh, name=f"{name}, Kept Rank: {args.keep_rank}")  # Use the name
                individual_signatures.append(ss)

#            ss = SourmashSignature(mh, name=f"{name}, {full_lineage}, Lineage: {args.keep_rank}")

    with open(args.output, "wt") as fp:
        sourmash.save_signatures(individual_signatures, fp)

    print(f"Saved signatures to '{args.output}'")

    if args.save_names:
        print(f"saving matching signature names to '{args.save_names}'")
        with open(args.save_names, "wt") as fp:
            for ident, idx in db._ident_to_idx.items():
                if idx in matching_idx:
                    name = db._ident_to_name[ident]
                    fp.write(f"{name}\n")
    else:
        print(f"did not save matching names; see --save-names")

if __name__ == '__main__':
    sys.exit(main())

ctb commented 1 year ago

Any advice on how to return a concatenated list of hashs that share a name would be very appreciated.

didn't read the code, but -

hashes_by_name = collections.defaultdict(set)

for hashval in many_hashes:
    names = get_names_for_hashval(hashval, ...)
    for name in names:
        hashes_by_name[name].add(hashval)

for name, hashvals in hashes_by_name.items():
   ...

is a pretty typical pattern for me ;)

ctb commented 1 year ago

this is sort of a side note, but: if you're running into memory or disk-loading problems, there's a sqlite-based implementation of LCA databases that should be much faster to load and query. Happy to discuss where your current approach is challenging you.

sourmash-bio / sourmash

Subsetting LCA database by taxonomic lineage #2754