Open phiweger opened 4 years ago
There's no good way to do this within sourmash. Suggestions welcome! I used snakemake --
https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L123
(the key is this line, which looks up the name in a spreadsheet - https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L129)
ah, and sourmash lca index
matches signatures based on the "accession" column in the csv and the "name" field in the signature?
On Sat, Jul 11, 2020 at 06:05:43AM -0700, Adrian Viehweger wrote:
ah, and
sourmash lca index
matches signatures based on the "accession" column in the csv and the "name" field in the signature?
yep!
...I should probably document that, eh?
well, only if you want to :)
I ended up just working around the problem -- I don't see a very clean API to match names from the sourmash compute
step and the csv required in sourmash lca index
. Probably some kind of helper script?
Code for future reference:
import json
from glob import glob
import os
from tqdm import tqdm
'''
1. Add taxonomy identifier to signature. In our case we'll use the one in
the GTDB taxonomy csv file of the form GB_GCA_002163135.1, ie just the
filename prefix.
'''
# !mkdir sigs_renamed
files = glob('sigs/*.sig')
for path in tqdm(files):
with open(path, 'r') as file:
sig = json.load(file)
fn = sig[0]['filename']
name = os.path.basename(fn).replace('_genomic.fna.gz', '')
sig[0]['name'] = name
with open('sigs_renamed/' + os.path.basename(path), 'w+') as out:
json.dump(sig, out)
'''
2. Now reformat the taxonomy provided by the GTDB.
'''
with open('taxonomy.csv', 'r') as file, \
open('taxonomy.refmt.csv', 'w+') as out:
# header
out.write('accession,superkingdom,phylum,class,order,family,genus,species\n')
for line in file:
name, tax = line.strip().split('\t')
tax = [i.split('__')[1] for i in tax.split(';')]
out.write(f'{name},' + ','.join(tax) + '\n')
thanks for the code!
Other thoughts -
please leave this open and I'll work on it as I'm inspired :)
this is now built into our various database release processes; see https://github.com/sourmash-bio/sourmash/issues/2015 for a guide, and Taylor put together an R script to do it, here: https://github.com/sourmash-bio/sourmash/issues/1941#issuecomment-1095291194. Still no standalone script that does it and is in version control tho :).
I swear I tried to find out how to do this before filing an issue :)
The GTDB taxonomy has this format:
I have sketched all genomes in a folder "sigs" and they are named like:
They were NOT "named" using
sourmash compute --name-from-first ...
.What I don't understand is what should be the accession in the LCA taxonomy file so that it is linked to the signatures in my "sigs" directory?
I tried this script which returns an LCA taxonomy file like
and then run
but
Any help is greatly appreciated.