joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

How to import custom taxonomy table from MG7? #221

Closed pablopareja closed 10 years ago

pablopareja commented 11 years ago

Hi,

I was willing to use Phyloseq to visualize some metagenomics results from our project MG7 ( https://github.com/pablopareja/MG7 ). I've been having a look at the Biom format http://biom-format.org/documentation/format_versions/biom-1.0.html and I would be interested in using the sparse OTU table option. However I'm not sure how/where should I provide the taxonomy information? Is it possible to somehow provide your own taxonomy tree as an input to the tool?

Thanks!

joey711 commented 11 years ago

@pablopareja, thanks for the feedback.

The phyloseq tutorial on importing data includes examples importing the major types of biom format files, including sparse OTU tables that have taxonomy information included. It is typically included as OTU metadata, as opposed to sample metadata. You may want to inquire biom format definition for more specific details.

As for your final question, I'm not sure what you mean about "taxonomy tree". In phyloseq you can import a taxonomy table and/or a phylogenetic tree, but these are substantially different data structures, usually with quite different resolutions.

I will keep this issue open if you respond with a clarification.

pablopareja commented 11 years ago

Hi Joey,

Thanks for your reply. Regarding the "taxonomy tree", in any case what I would like to do in the end is to use NCBI's taxonomy, would that be possible?

pablopareja commented 11 years ago

Any ideas?

joey711 commented 11 years ago

Can you be more specific? You still have not clarified if you mean a phylogenetic tree, or a taxonomy table. These are very different. If you have a specific reference dataset in mind, can you post the link here?

pablopareja commented 11 years ago

This is the dataset I mean:

http://www.ncbi.nlm.nih.gov/taxonomy

joey711 commented 11 years ago

That link is not to a particular dataset. It has many links. Do you have a particular data file in mind? For example, on that same page is an FTP to files that might be useful:

ftp://ftp.ncbi.nih.gov/pub/taxonomy

pablopareja commented 11 years ago

Yeah, that's the one I meant. You can download the files included in this tar-gz file:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

joey711 commented 11 years ago

It looks like they have included a tree-like structure with nodes and edges in a non-standard delimited format. The readme.txt includes a format definition, but that doesn't help me answer your question. What are you trying to do, exactly? Do you have a collection of organisms that you'd like to investigate, or do you really want to use this entire database? And what is it that you want to use phyloseq to do? Tables of taxonomy, where each row is a separate organism in your dataset, are easy to import and be recognized by phyloseq. The central data type in phyloseq, however, is a contingency table of organisms (OTUs) and samples as the dimensions. The taxonomy table is a table that describes the taxonomic classification for each OTU. The quantitative evolutionary relationship between OTUs is represented by a phylogenetic tree, which in phyloseq/R is the "phylo" class defined in the ape package. phlyoseq includes tools to help you import both taxonomy tables and trees. I could describe more, but it still isn't clear which of these supported data classes you ultimately want. Can you clarify?

joey711 commented 11 years ago

After quite a bit of back and forth, I have not been able to determine what your issue is in phyloseq, or what precise file, or file-structure, you are attempting to import... or for what purpose. For the moment I am closing this issue unless we can clarify a specific, actionable goal related to this issue.

Cheers, and best of luck

Joey

rtobes commented 11 years ago

Sorry for the dealy in responding your last comment. We have our own method for taxonomic assignment named MG7 that uses nt as reference database to classify the reads. We do a BLAST similarity based taxonomic assignment but in our system each read is independently assigned to one taxonomical node corresponding to the taxonomy tree of NCBI official taxonomy. We would like to use phyloseq for the meta-analysis of diversity after taxonomic assignment but we need to use the taxonomy tree of NCBI and in all your available examples the taxonomy data are in the peculiar format that QIIME use for taxonomy. We would need some help to know how work with phyloseq using the NCBI taxonomy tree and the format of NCBI taxonomy to indicate the classification for each read.

joey711 commented 11 years ago

There is infrastructure included in phyloseq for using custom taxonomy tables with custom-parsing requirements. In order to know what is needed, I would need an example output file (as opposed to the entire NCBI reference file). So what are your output file(s) from this MG7 method?

Also, if you parse the output into a table with OTUs as rows, and if those row-names are the same as for the OTUs in your table of counts, then this is already a usable format for phyloseq.

pablopareja commented 11 years ago

Hi Joey, I just created a gist including a short extract of a sample output file from MG7: https://gist.github.com/pablopareja/6068349#file-mg7outputfileextract Please let me know if you have any question about it. Thanks!

joey711 commented 11 years ago

I will re-open this and take a look. If it's straightforward, I will try to add something in phyloseq. If it's complicated, or there is not a clear data mapping between the MG7 output and data structures in phyloseq, it might have to wait.

I'll let you know.

Cheers

joey

joey711 commented 10 years ago

@pablopareja

I think this is just a tab-delimited table, with the first row the column names, and the first column the row names. You can import with R's standard read.table command, and then coerce to OTU-table class with otu_table.

rawtab = read.table("MG7file", header=TRUE, sep="\t")
OTU = otu_table(rawtab, TRUE)

Let me know if this doesn't work, and I'll re-open the issue.