Closed brooksph closed 6 years ago
Good idea, but where should we get this information from?
Something like kraken-build --download-taxonomy --db $DBNAME
can be a start (from http://ccb.jhu.edu/software/kraken/MANUAL.html)
Downloads data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, might need to figure out what they do with it.
@luizirber suggested piggybacking off of the Karen tree information. http://ccb.jhu.edu/software/kraken/MANUAL.html
E.g. taxonomy/nodes.dmp + taxonomy/names.dmp: Taxonomy names might have good information? I can look more deeply.
My slightly more Luddite and straight forward idea was that we might just snag the tax id information from the metadata and use that as a means of pulling out phylogeny. You can get the full lineage from this file which is apparently updated regularly. So... that might be made into a sort of db to go alongside the sbt?
Info on taxon id db: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt Actual download: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.Z
Alright, so as I understand them from reading over the files quickly...
nodes.dmp: for any given tax ID it provides the information to identify the 'parental' tax-id (in addition to other things e.g. rank level for the node id) names.dmp: gives the actual name for the tax ID.
So, if we take the example of Prochlorococcus marinus (Tax ID 1041938). The look up in names.dmp provides the different strains of Prochlorococcus. Looking up in nodes.dmp provides the genus id 1218 which if you look up in names.dmp is Prochlorococcus.
One problem/difficulty that immediately jumps out at me is that there are multiple different entries for a given tax id sometimes. Because why choose just one???
1213 | "Prochlorales" Lewin 1977 | | synonym | 1213 | Prochloraceae | | scientific name | 1213 | Prochloraceae (ex Lewin 1977) Florenzano et al. 1986 | | authority | 1213 | Prochloraceae Lewin 1977 | | authority | 1213 | Prochlorales | | synonym | 1213 | Prochlorales (ex Lewin 1977) Florenzano et al. 1986 emend. Burger-Wiersma et al. 1989 | | authorit| 1213 | Prochlorococcaceae | | synonym | 1213 | prochlorophytes | | common name |
Asked Zach Foster about what kind of info and output would be most useful for packages like metacoder.
talking to @meren to find out what they need for anvi'o and to see if we can swipe some code, too.
I've been using something like this to build my own database here (just isolated this function from a larger context):
import sys
import requests
from xml.etree import ElementTree
taxonomy = ['superkingdom', 'phylum', 'order', 'class', 'family', 'genus', 'species']
def get_lineage(tax_id):
response = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=%s' % str(tax_id))
tree = ElementTree.fromstring(response.content)
lineage = dict(zip(taxonomy, [None] * len(taxonomy)))
try:
taxon = tree.findall('Taxon')[0]
except:
return lineage
for e in taxon.findall('LineageEx')[0].findall('Taxon'):
rank = e.find('Rank').text
if rank in lineage:
lineage[rank] = e.find('ScientificName').text.replace(' ', '_')
return lineage
if __name__ == '__main__':
print(get_lineage(sys.argv[1]))
Here:
$ python get_lineage.py 665953
{'superkingdom': 'Bacteria', 'family': 'Bacteroidaceae', 'class': 'Bacteroidia', 'phylum': 'Bacteroidetes', 'species': 'Bacteroides_eggerthii', 'genus': 'Bacteroides', 'order': 'Bacteroidales'}
This is just to play, of course. There are much better ways to deal with this issue in general. But requires lots of communication and logistics :/
Hi @ctb, thanks for asking.
what kind of info and output would be most useful for packages like metacoder.
Currently, metacoder is best at parsing any text-based format that is 1) sequential and 2) encodable by regular expressions. By these terms I mean:
For examples of this, I suggest looking at our parsing tutorial: https://grunwaldlab.github.io/metacoder_documentation/vignettes--01--extracting_taxonomy_data.html
However, these examples are for parsing FASTA headers, not tables of information.
I dont know what kind of information you are looking to plot or how you want your output files to be formatted yet.
If they are tables, then there is a function called parse_taxonomy_table
, that is just a wrapper for extract_taxonomy
, that will get the taxonomic information for each row from one column and save the rest of the data in the parsed object and associate it with the taxonomy. That column could have nearly any format, but the easiest to parameterize would be something like:
Fungi;Ascomycota;Leotiomycetes;Helotiales
...
or
k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales
...
I expect that I will be rewriting parse_taxonomy_table
in the near future to be simpler and more flexible. When I do that I plan on adding convenience wrappers for parse_taxonomy_table
that parse common formats. I could easily add one for sourmash once I get an idea of what output format you decide on. I expect that whatever format you choose, metacoder will be able to parse it, but some formats will be easier for users to figure out the correct parameters than others. The simplest would be a table with one column that has classifications like Fungi;Ascomycota;Leotiomycetes;Helotiales
and any number of other columns with any kind of information you like. The data from the other columns will be available in the parsed object for plotting/filtering/whatever.
Hope that helps. Feel free to ask any questions if something does not make sense. I could say more if I had a better idea of the information you want to output.
RE krona & NCBI taxonomy numbers:
I think supporting krona output is not necessarily the way to go. Perhaps instead outputting NCBI taxonomy number would be a better use of time, so that it's easy for people to feed their results in to a lot of downstream analyses, not just krona.
I like the Krona visualization, but it is glitchy with large-ish datasets (RNAseq, 4M reads).
Originally posted #174
ok given that we want to do this for 100,000 entries I'm going to mash up @halexand and @meren suggestions.
On MSU HPC under /mnt/home/ctb/research/tax
, I have code to produce a leveldb of the access2taxid files (acc2taxid-to-leveldb.py
) and under the subdir taxdump
the following script seems to roughly work --
% python -i parse.py nodes.dmp names.dmp
('Shewanella baltica OS223', '', 'scientific name')
('Shewanella baltica', '', 'scientific name')
('Shewanella', '', 'scientific name')
('Shewanellaceae', '', 'scientific name')
('Alteromonadales', '', 'scientific name')
('Gammaproteobacteria', '', 'scientific name')
('Proteobacteria', '', 'scientific name')
('Bacteria', 'Bacteria <prokaryotes>', 'scientific name')
('cellular organisms', '', 'scientific name')
My current plan is to:
finish producing the leveldb for the WGS and Genbank Nucleotide databases, so that we can quickly look up the taxid for a given acc;
make some form of db dump (maybe another leveldb?) for the much smaller nodes.dmp
and names.dmp
files;
extract all the accessions from our current genbank signatures;
write a CSV file that lets us take accession -> lineage like @meren's code above;
produce an ad hoc script that may suit our present needs for going from gather/search -> lineage;
...and then we'll at least have the code to annotate the signatures when we figure out what the right metadata approach is. Probably this will end up in the utils
submodule as a set of extra commands...
@taylorreiter what's the input format for krona again? thx!
ok, I can now get output like this:
CP001056,935198,Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium botulinum
CP001072,512562,Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Helicobacteraceae;Helicobacter;Helicobacter pylori
CP000805,455434,Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae;Treponema;Treponema pallidum
CP000934,498211,Bacteria;Proteobacteria;Gammaproteobacteria;Cellvibrionales;Cellvibrionaceae;Cellvibrio;Cellvibrio japonicus
CU469464,37692,Bacteria;Tenericutes;Mollicutes;Acholeplasmatales;Acholeplasmataceae;Candidatus Phytoplasma;Candidatus Phytoplasma mali
CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae
yay w00t.
too much detail but my memory is weak:
# in genbank SBT directory: extract accession info from genbank
find . \! -name \*internal\* -exec grep -h \"name\" {} \; > ../genbank-names.txt
# in /mnt/home/ctb/research/tax: connect accession info to taxid using REALLY BIG files
python acc2taxid.py ../genbank-k31-acc.txt nucl_gb.accession2taxid nucl_wgs.accession2taxid.gz
# in /mnt/home/ctb/research/tax/taxdump: output CSV file with accession, taxid, and ';'-separated lineage
python do-the-csv.py nodes.dmp names.dmp ../../genbank-k31-acc.txt.taxid
CSV file genbank-genomes-accession+lineage-20170529.csv.gz
containing 93,209 accessions + taxid + lineage in this format:
CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae
now available at:
Note that approximately 40 accessions in our genbank SBTs do not seem to appear in either nucl_wgs.accession2taxid
or nucl_gb.accession2taxid
and so are not represented in this CSV file. shrug
We'll work on utilities to extract and correlate this info with sourmash output in useful ways. In the meantime special thx to @meren for providing some useful code!
A few more thoughts on taxonomy:
in the metadata block for signatures generated from NCBI, we should definitely have a way to include accession(s) of included sequences and taxid(s) of included sequences. I do now think we should also provide an 'ncbi-tax-lineage' entry that encodes something standard like Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
because this really is unlikely to change and is quite useful to have ready to hand.
We've made quite a bit of headway on this issue, most recently with the sourmash LCA stuff (see gist and ncbi_taxdump_utils specifically. Just wanted to link that in here :)
metacodeR viz working: http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-part-1.html
a tutorial for metacodeR analysis on gather output: https://hackmd.io/EYMwTAhgHGwMwFoCmcAsqGoAwGMwIggE4ATBKHCVANiSxMpziA==
sourmash lca
and https://github.com/dib-lab/2018-ncbi-lineages now handle various bits of this. sourmash lca
is the way forward for taxonomy in sourmash, at least for now.
@halexand suggested adding taxonomic breakdowns to sbt signatures and subsequently, gather output. This could facilitate downstream analysis at different taxonomic levels. For example, kraken does
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli