Open ccbaumler opened 1 year ago
Hi @ccbaumler
I think you want to add an argument for the rank (or ranks) with LCA hashes you want to keep, e.g. species
.
then, here's where you would want to modify:
# find all relevant lineage IDs (lid)
matching_lids = set()
for lineage, lid in db.lineage_to_lid.items():
for (rank, name) in lineage:
# could do flexible matching here
if name == args.lineage_str:
print(f"found: {name} at rank {rank}")
full_lineage = ";".join(zip_lineage(lineage))
print(f"full lineage lid={lid}: {full_lineage}")
matching_lids.add(lid)
break
change name == args.lineage_str
to:
lineage[-1].rank == args.keep_rank
and remove print(f"found: {name} at rank {rank}")
Thank you for the advice, @bluegenes ! I tried lineage[-1].rank == args.keep_rank
but since my last value in lineage is strain it gave an error. I had to update some of the API indices as well. I used print(dir(db))
to see all the attributes of the database created and went with those values.
So far, I have written a script that sort of works. The problem is that it either creates a single sig file for each hash or a database of single signatures for each hash. Any advice on how to return a concatenated list of hashs that share a name
would be very appreciated.
Here is the script:
#! /usr/bin/env python
import sys
import argparse
import sourmash
from sourmash.lca.lca_utils import zip_lineage
from sourmash import MinHash, SourmashSignature
def _create_minhash(db):
is_protein = False
is_hp = False
is_dayhoff = False
if db.moltype == 'protein':
is_protein = True
elif db.moltype == 'hp':
is_hp = True
elif db.moltype == 'dayhoff':
is_dayhoff = True
minhash = MinHash(n=0, ksize=db.ksize, scaled=db.scaled,
is_protein=is_protein, hp=is_hp, dayhoff=is_dayhoff)
return minhash
def main():
p = argparse.ArgumentParser()
p.add_argument('lca_db')
p.add_argument('keep_rank', help="Keep the rank specified (i.e. Species)")
p.add_argument('-o', '--output', help="output signature containing rank-specific hash values", required=True)
p.add_argument('-s', '--save-names', help="save list of matching signature names from that lineage to this file")
args = p.parse_args()
db, ksize, scaled = sourmash.lca.lca_db.load_single_database(args.lca_db)
print(f"loaded db k={ksize} scaled={scaled} from '{args.lca_db}'")
# find all relevant lineage IDs (lid)
matching_lids = set()
for lineage, lid in db._lineage_to_lid.items():
for (rank) in lineage:
if any(taxon.rank == args.keep_rank for taxon in lineage):
full_lineage = ";".join(zip_lineage(lineage))
print(f"full lineage lid={lid}: {full_lineage}")
matching_lids.add(lid)
break
# find all matching idx
matching_idx = set()
for idx, lid in db._idx_to_lid.items():
if lid in matching_lids:
matching_idx.add(lid)
print(f"found {len(matching_idx)} matching genomes.")
individual_signatures = [] # Create an empty list to store individual signatures
for hashval, idx_set in db._hashval_to_idx.items():
idx_set = set(idx_set)
if idx_set.issubset(matching_idx):
# Find the name associated with the matching idx
name = None
for ident, idx in db._ident_to_idx.items():
if idx in idx_set:
name = db._ident_to_name[ident]
break
if name:
#full_lineage = ";".join(zip_lineage(lineage)) # Get full lineage as a string
mh = _create_minhash(db)
mh.add_hash(hashval) # Use add_hash method to add a single hash value
ss = SourmashSignature(mh, name=f"{name}, Kept Rank: {args.keep_rank}") # Use the name
individual_signatures.append(ss)
# ss = SourmashSignature(mh, name=f"{name}, {full_lineage}, Lineage: {args.keep_rank}")
with open(args.output, "wt") as fp:
sourmash.save_signatures(individual_signatures, fp)
print(f"Saved signatures to '{args.output}'")
if args.save_names:
print(f"saving matching signature names to '{args.save_names}'")
with open(args.save_names, "wt") as fp:
for ident, idx in db._ident_to_idx.items():
if idx in matching_idx:
name = db._ident_to_name[ident]
fp.write(f"{name}\n")
else:
print(f"did not save matching names; see --save-names")
if __name__ == '__main__':
sys.exit(main())
Any advice on how to return a concatenated list of hashs that share a
name
would be very appreciated.
didn't read the code, but -
hashes_by_name = collections.defaultdict(set)
for hashval in many_hashes:
names = get_names_for_hashval(hashval, ...)
for name in names:
hashes_by_name[name].add(hashval)
for name, hashvals in hashes_by_name.items():
...
is a pretty typical pattern for me ;)
this is sort of a side note, but: if you're running into memory or disk-loading problems, there's a sqlite-based implementation of LCA databases that should be much faster to load and query. Happy to discuss where your current approach is challenging you.
Hello, as of #2753 I have been able to build my own LCA databases from my existing sourmash signature databases.
I was interested in subsetting the current LCA database by taxonomy (i.e. Filtering out everything above/below a specific rank). Any advice is appreciated! I think I will begin by modifying #1784 for multiple lineages...