Open mr-eyes opened 1 year ago
Hi @mr-eyes,
For two files, list.txt
GCF_014075335.1
GCF_900013275.1
and list2.txt
:
GCF_014075335.1
GCF_900013275.1
GCF_900698905.1
you can run:
./acc-to-lca.py gtdb-rs202.taxonomy.v2.csv list.txt list2.txt
and you should get:
loaded 258406 tax entries.
keys are of the form: ['GCF_014075335.1', 'GCF_002310555.1']
---
loaded 2 accs from 'list.txt'
For 'list.txt', LCA is: rank=species, name=s__Escherichia flexneri
loaded 3 accs from 'list2.txt'
For 'list2.txt', LCA is: rank=family, name=f__Enterobacteriaceae
@bluegenes and I can help you wrangle the output into exactly the right format if you wish!
loading a CSV with 260k lines in it can take a whie; note you can run
sourmash tax prepare -t gtdb-rs202.taxonomy.v2.csv -o gtdb-rs202.taxonomy.v2.sqldb -F sql --keep-ident
and then use
./acc-to-lca.py gtdb-rs202.taxonomy.v2.csv list.txt list2.txt
instead and it will be much faster.
file acc-to-lca.py
:
#! /usr/bin/env python
import sys
import argparse
import sourmash
from sourmash.tax.tax_utils import MultiLineageDB, RankLineageInfo
def main():
p = argparse.ArgumentParser()
p.add_argument('taxdb')
p.add_argument('accs', nargs='+')
args = p.parse_args()
tax = MultiLineageDB.load([args.taxdb])
print(f"loaded {len(tax)} tax entries.", file=sys.stderr)
print(f"keys are of the form: {list(tax.keys())[:2]}", file=sys.stderr)
print('---')
for accfile in args.accs:
acc_set = set([ x.strip() for x in open(accfile, 'rt') ])
print(f"loaded {len(acc_set)} accs from '{accfile}'", file=sys.stdout)
lineage_list = []
lca = None
for acc in acc_set:
if acc not in tax:
print(f"ERROR: cannot find '{acc}' in tax db. IGNORING.", file=sys.stderr)
continue
rank = RankLineageInfo(lineage=tax[acc])
lineage_list.append(rank)
# for all but first one, find LCA
if lca is None:
lca = rank
else:
lca = lca.find_lca(rank)
print(f"For '{accfile}', LCA is: rank={lca.filled_lineage[-1].rank}, name={lca.filled_lineage[-1].name}")
if __name__ == '__main__':
main()
(ok fixed the premature comment @mr-eyes @bluegenes )
I have a list of accession numbers. Can I use sourmash to generate the equivalent LCAs?