sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
479 stars 79 forks source link

GTDB accessions to LCAs? #2589

Open mr-eyes opened 1 year ago

mr-eyes commented 1 year ago

I have a list of accession numbers. Can I use sourmash to generate the equivalent LCAs?

ctb commented 1 year ago

Hi @mr-eyes,

For two files, list.txt

GCF_014075335.1
GCF_900013275.1

and list2.txt:

GCF_014075335.1
GCF_900013275.1
GCF_900698905.1

you can run:

./acc-to-lca.py gtdb-rs202.taxonomy.v2.csv list.txt list2.txt

and you should get:

loaded 258406 tax entries.
keys are of the form: ['GCF_014075335.1', 'GCF_002310555.1']
---
loaded 2 accs from 'list.txt'
For 'list.txt', LCA is: rank=species, name=s__Escherichia flexneri
loaded 3 accs from 'list2.txt'
For 'list2.txt', LCA is: rank=family, name=f__Enterobacteriaceae

@bluegenes and I can help you wrangle the output into exactly the right format if you wish!

speeding things up

loading a CSV with 260k lines in it can take a whie; note you can run

sourmash tax prepare -t gtdb-rs202.taxonomy.v2.csv -o gtdb-rs202.taxonomy.v2.sqldb -F sql --keep-ident

and then use

./acc-to-lca.py gtdb-rs202.taxonomy.v2.csv list.txt list2.txt

instead and it will be much faster.

the code

file acc-to-lca.py:

#! /usr/bin/env python
import sys
import argparse
import sourmash
from sourmash.tax.tax_utils import MultiLineageDB, RankLineageInfo

def main():
    p = argparse.ArgumentParser()
    p.add_argument('taxdb')
    p.add_argument('accs', nargs='+')
    args = p.parse_args()

    tax = MultiLineageDB.load([args.taxdb])
    print(f"loaded {len(tax)} tax entries.", file=sys.stderr)
    print(f"keys are of the form: {list(tax.keys())[:2]}", file=sys.stderr)
    print('---')

    for accfile in args.accs:
        acc_set = set([ x.strip() for x in open(accfile, 'rt') ])
        print(f"loaded {len(acc_set)} accs from '{accfile}'", file=sys.stdout)

        lineage_list = []
        lca = None
        for acc in acc_set:
            if acc not in tax:
                print(f"ERROR: cannot find '{acc}' in tax db. IGNORING.", file=sys.stderr)
                continue

            rank = RankLineageInfo(lineage=tax[acc])
            lineage_list.append(rank)

            # for all but first one, find LCA
            if lca is None:
                lca = rank
            else:
                lca = lca.find_lca(rank)

        print(f"For '{accfile}', LCA is: rank={lca.filled_lineage[-1].rank}, name={lca.filled_lineage[-1].name}")

if __name__ == '__main__':
    main()
ctb commented 1 year ago

(ok fixed the premature comment @mr-eyes @bluegenes )