sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

Metadata - Add taxonomic breakdown to sourmash output #195

Closed brooksph closed 6 years ago

brooksph commented 7 years ago

@halexand suggested adding taxonomic breakdowns to sbt signatures and subsequently, gather output. This could facilitate downstream analysis at different taxonomic levels. For example, kraken does d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli

ctb commented 7 years ago

Good idea, but where should we get this information from?

luizirber commented 7 years ago

Something like kraken-build --download-taxonomy --db $DBNAME can be a start (from http://ccb.jhu.edu/software/kraken/MANUAL.html)

Downloads data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, might need to figure out what they do with it.

halexand commented 7 years ago

@luizirber suggested piggybacking off of the Karen tree information. http://ccb.jhu.edu/software/kraken/MANUAL.html

E.g. taxonomy/nodes.dmp + taxonomy/names.dmp: Taxonomy names might have good information? I can look more deeply.

My slightly more Luddite and straight forward idea was that we might just snag the tax id information from the metadata and use that as a means of pulling out phylogeny. You can get the full lineage from this file which is apparently updated regularly. So... that might be made into a sort of db to go alongside the sbt?

Info on taxon id db: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt Actual download: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.Z

halexand commented 7 years ago

Alright, so as I understand them from reading over the files quickly...

nodes.dmp: for any given tax ID it provides the information to identify the 'parental' tax-id (in addition to other things e.g. rank level for the node id) names.dmp: gives the actual name for the tax ID.

So, if we take the example of Prochlorococcus marinus (Tax ID 1041938). The look up in names.dmp provides the different strains of Prochlorococcus. Looking up in nodes.dmp provides the genus id 1218 which if you look up in names.dmp is Prochlorococcus.

One problem/difficulty that immediately jumps out at me is that there are multiple different entries for a given tax id sometimes. Because why choose just one???

1213 | "Prochlorales" Lewin 1977 | | synonym | 1213 | Prochloraceae | | scientific name | 1213 | Prochloraceae (ex Lewin 1977) Florenzano et al. 1986 | | authority | 1213 | Prochloraceae Lewin 1977 | | authority | 1213 | Prochlorales | | synonym | 1213 | Prochlorales (ex Lewin 1977) Florenzano et al. 1986 emend. Burger-Wiersma et al. 1989 | | authorit| 1213 | Prochlorococcaceae | | synonym | 1213 | prochlorophytes | | common name |

ctb commented 7 years ago

Asked Zach Foster about what kind of info and output would be most useful for packages like metacoder.

ctb commented 7 years ago

talking to @meren to find out what they need for anvi'o and to see if we can swipe some code, too.

meren commented 7 years ago

I've been using something like this to build my own database here (just isolated this function from a larger context):

import sys
import requests

from xml.etree import ElementTree

taxonomy = ['superkingdom', 'phylum', 'order', 'class', 'family', 'genus', 'species']

def get_lineage(tax_id):
    response = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=%s' % str(tax_id))

    tree = ElementTree.fromstring(response.content)

    lineage = dict(zip(taxonomy, [None] * len(taxonomy)))

    try:
        taxon = tree.findall('Taxon')[0]
    except:
        return lineage

    for e in taxon.findall('LineageEx')[0].findall('Taxon'):
        rank = e.find('Rank').text
        if rank in lineage:
            lineage[rank] = e.find('ScientificName').text.replace(' ', '_')

    return lineage

if __name__ == '__main__':
    print(get_lineage(sys.argv[1]))

Here:

$ python get_lineage.py 665953
{'superkingdom': 'Bacteria', 'family': 'Bacteroidaceae', 'class': 'Bacteroidia', 'phylum': 'Bacteroidetes', 'species': 'Bacteroides_eggerthii', 'genus': 'Bacteroides', 'order': 'Bacteroidales'}

This is just to play, of course. There are much better ways to deal with this issue in general. But requires lots of communication and logistics :/

zachary-foster commented 7 years ago

Hi @ctb, thanks for asking.

what kind of info and output would be most useful for packages like metacoder.

Currently, metacoder is best at parsing any text-based format that is 1) sequential and 2) encodable by regular expressions. By these terms I mean:

  1. Repeating units in the same format representing taxonomic classifications and associated information.
  2. All pieces of information that should be preserved during parsing can be defined by capture groups (parentheses) in regular expressions.

For examples of this, I suggest looking at our parsing tutorial: https://grunwaldlab.github.io/metacoder_documentation/vignettes--01--extracting_taxonomy_data.html

However, these examples are for parsing FASTA headers, not tables of information. I dont know what kind of information you are looking to plot or how you want your output files to be formatted yet. If they are tables, then there is a function called parse_taxonomy_table, that is just a wrapper for extract_taxonomy, that will get the taxonomic information for each row from one column and save the rest of the data in the parsed object and associate it with the taxonomy. That column could have nearly any format, but the easiest to parameterize would be something like:

Fungi;Ascomycota;Leotiomycetes;Helotiales... or

k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales...

I expect that I will be rewriting parse_taxonomy_table in the near future to be simpler and more flexible. When I do that I plan on adding convenience wrappers for parse_taxonomy_table that parse common formats. I could easily add one for sourmash once I get an idea of what output format you decide on. I expect that whatever format you choose, metacoder will be able to parse it, but some formats will be easier for users to figure out the correct parameters than others. The simplest would be a table with one column that has classifications like Fungi;Ascomycota;Leotiomycetes;Helotiales and any number of other columns with any kind of information you like. The data from the other columns will be available in the parsed object for plotting/filtering/whatever.

Hope that helps. Feel free to ask any questions if something does not make sense. I could say more if I had a better idea of the information you want to output.

taylorreiter commented 7 years ago

RE krona & NCBI taxonomy numbers:

I think supporting krona output is not necessarily the way to go. Perhaps instead outputting NCBI taxonomy number would be a better use of time, so that it's easy for people to feed their results in to a lot of downstream analyses, not just krona.

I like the Krona visualization, but it is glitchy with large-ish datasets (RNAseq, 4M reads).

Originally posted #174

ctb commented 7 years ago

ok given that we want to do this for 100,000 entries I'm going to mash up @halexand and @meren suggestions.

On MSU HPC under /mnt/home/ctb/research/tax, I have code to produce a leveldb of the access2taxid files (acc2taxid-to-leveldb.py) and under the subdir taxdump the following script seems to roughly work --

 % python -i parse.py nodes.dmp names.dmp
('Shewanella baltica OS223', '', 'scientific name')
('Shewanella baltica', '', 'scientific name')
('Shewanella', '', 'scientific name')
('Shewanellaceae', '', 'scientific name')
('Alteromonadales', '', 'scientific name')
('Gammaproteobacteria', '', 'scientific name')
('Proteobacteria', '', 'scientific name')
('Bacteria', 'Bacteria <prokaryotes>', 'scientific name')
('cellular organisms', '', 'scientific name')

My current plan is to:

...and then we'll at least have the code to annotate the signatures when we figure out what the right metadata approach is. Probably this will end up in the utils submodule as a set of extra commands...

@taylorreiter what's the input format for krona again? thx!

ctb commented 7 years ago

ok, I can now get output like this:

CP001056,935198,Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium botulinum
CP001072,512562,Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Helicobacteraceae;Helicobacter;Helicobacter pylori
CP000805,455434,Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae;Treponema;Treponema pallidum
CP000934,498211,Bacteria;Proteobacteria;Gammaproteobacteria;Cellvibrionales;Cellvibrionaceae;Cellvibrio;Cellvibrio japonicus
CU469464,37692,Bacteria;Tenericutes;Mollicutes;Acholeplasmatales;Acholeplasmataceae;Candidatus Phytoplasma;Candidatus Phytoplasma mali
CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae

yay w00t.

ctb commented 7 years ago

too much detail but my memory is weak:

# in genbank SBT directory: extract accession info from genbank
find . \! -name \*internal\* -exec grep -h \"name\" {} \; > ../genbank-names.txt

# in /mnt/home/ctb/research/tax: connect accession info to taxid using REALLY BIG files
python acc2taxid.py ../genbank-k31-acc.txt nucl_gb.accession2taxid nucl_wgs.accession2taxid.gz 

# in /mnt/home/ctb/research/tax/taxdump: output CSV file with accession, taxid, and ';'-separated lineage
python do-the-csv.py nodes.dmp names.dmp ../../genbank-k31-acc.txt.taxid
ctb commented 7 years ago

CSV file genbank-genomes-accession+lineage-20170529.csv.gz containing 93,209 accessions + taxid + lineage in this format:

CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae

now available at:

https://osf.io/28ymr/

Note that approximately 40 accessions in our genbank SBTs do not seem to appear in either nucl_wgs.accession2taxid or nucl_gb.accession2taxid and so are not represented in this CSV file. shrug

ctb commented 7 years ago

We'll work on utilities to extract and correlate this info with sourmash output in useful ways. In the meantime special thx to @meren for providing some useful code!

ctb commented 7 years ago

code in https://github.com/dib-lab/2017-ncbi-taxdump

ctb commented 7 years ago

A few more thoughts on taxonomy:

in the metadata block for signatures generated from NCBI, we should definitely have a way to include accession(s) of included sequences and taxid(s) of included sequences. I do now think we should also provide an 'ncbi-tax-lineage' entry that encodes something standard like Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans because this really is unlikely to change and is quite useful to have ready to hand.

ctb commented 7 years ago

We've made quite a bit of headway on this issue, most recently with the sourmash LCA stuff (see gist and ncbi_taxdump_utils specifically. Just wanted to link that in here :)

ctb commented 7 years ago

metacodeR viz working: http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-part-1.html

ctb commented 7 years ago

a tutorial for metacodeR analysis on gather output: https://hackmd.io/EYMwTAhgHGwMwFoCmcAsqGoAwGMwIggE4ATBKHCVANiSxMpziA==

ctb commented 6 years ago

sourmash lca and https://github.com/dib-lab/2018-ncbi-lineages now handle various bits of this. sourmash lca is the way forward for taxonomy in sourmash, at least for now.