How do I build a lineage spreadsheet for GTDB taxonomy and signatures?

phiweger commented 4 years ago

I swear I tried to find out how to do this before filing an issue :)

The GTDB taxonomy has this format:

GB_GCA_002849265.1      d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
RS_GCF_000637235.1      d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus

I have sketched all genomes in a folder "sigs" and they are named like:

GB_GCA_002849265.1_genomic.fna.gz.sig
RS_GCF_000637235.1_genomic.fna.gz.sig

They were NOT "named" using sourmash compute --name-from-first ....

What I don't understand is what should be the accession in the LCA taxonomy file so that it is linked to the signatures in my "sigs" directory?

I tried this script which returns an LCA taxonomy file like

accession,filename,superkingdom,phylum,class,order,family,genus,species
GCF_900045825,GCF_900045825.1_genomic.fna.gz,d__Bacteria,p__Firmicutes,c__Bacilli,o__Staphylococcales,f__Staphylococcaceae,g__Staphylococcus,s__Staphylococcus aureus

and then run

sourmash lca index -C 3 --require-taxonomy --split-identifiers lca_tax.csv out --traverse-directory sigs

but

examining spreadsheet headers...
** assuming column 'accession' is identifiers in spreadsheet
145904 distinct identities in spreadsheet out of 145904 rows.
24706 distinct lineages in spreadsheet out of 145904 rows.
ERROR: no hash values found - are there any signatures?

Any help is greatly appreciated.

ctb commented 4 years ago

There's no good way to do this within sourmash. Suggestions welcome! I used snakemake --

https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L123

ctb commented 4 years ago

(the key is this line, which looks up the name in a spreadsheet - https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L129)

phiweger commented 4 years ago

ah, and sourmash lca index matches signatures based on the "accession" column in the csv and the "name" field in the signature?

ctb commented 4 years ago

On Sat, Jul 11, 2020 at 06:05:43AM -0700, Adrian Viehweger wrote:

ah, and sourmash lca index matches signatures based on the "accession" column in the csv and the "name" field in the signature?

yep!

...I should probably document that, eh?

phiweger commented 4 years ago

well, only if you want to :)

phiweger commented 4 years ago

I ended up just working around the problem -- I don't see a very clean API to match names from the sourmash compute step and the csv required in sourmash lca index. Probably some kind of helper script?

Code for future reference:

import json
from glob import glob
import os

from tqdm import tqdm

'''
1. Add taxonomy identifier to signature. In our case we'll use the one in
the GTDB taxonomy csv file of the form GB_GCA_002163135.1, ie just the
filename prefix.
'''

# !mkdir sigs_renamed
files = glob('sigs/*.sig')
for path in tqdm(files):
    with open(path, 'r') as file:
        sig = json.load(file)
        fn = sig[0]['filename']
        name = os.path.basename(fn).replace('_genomic.fna.gz', '')
        sig[0]['name'] = name
        with open('sigs_renamed/' + os.path.basename(path), 'w+') as out:
            json.dump(sig, out)

'''
2. Now reformat the taxonomy provided by the GTDB.
'''

with open('taxonomy.csv', 'r') as file, \
     open('taxonomy.refmt.csv', 'w+') as out:

    # header
    out.write('accession,superkingdom,phylum,class,order,family,genus,species\n')
    for line in file:
        name, tax = line.strip().split('\t')
        tax = [i.split('__')[1] for i in tax.split(';')]
        out.write(f'{name},' + ','.join(tax) + '\n')

ctb commented 4 years ago

thanks for the code!

Other thoughts -

could support md5sum and/or filename for matching.

please leave this open and I'll work on it as I'm inspired :)

ctb commented 2 years ago

this is now built into our various database release processes; see https://github.com/sourmash-bio/sourmash/issues/2015 for a guide, and Taylor put together an R script to do it, here: https://github.com/sourmash-bio/sourmash/issues/1941#issuecomment-1095291194. Still no standalone script that does it and is in version control tho :).

sourmash-bio / sourmash

How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095