joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
576 stars 187 forks source link

Support for additional (non-taxonomic) data in tax_table #324

Open davidelliott opened 10 years ago

davidelliott commented 10 years ago

Hi Joey

I have been using phyloseq a lot on many projects and find it very useful.

The sample data slot is really useful for analysing relationships between the microbiome and sample metadata, and for manipulating the phyloseq object in relation to sample data. The taxonomy table is similarly useful for manipulations and analyses based on taxonomic information. In many cases though we have information about OTUs which is not taxonomic but we want to subset or analyse by.

I wonder therefore if it would be useful to have an extra slot in the phyloseq object for otu data? This could hold any kind of information about OTUs and allow the sort of functionality already possible for sample data and taxonomic information. I have been achieving something similar by having separate tables containing such things as source information from GenBank records, manually curated possible functions, known associations, etc. I have also tried putting this non-taxonomic information into the taxonomy table so that I can manipulate it as shown in the example below. It seems to work however this is clearly not the intended purpose.

library(phyloseq)
data(GlobalPatterns)

# reduce data for example
expt = subset_taxa(GlobalPatterns, Phylum == "Bacteroidetes")

# re-name the slot in the tax table, ready to hold OTU-related metadata
colnames(tax_table(expt))[1] <- "group"

# define a functional group for each OTU (randomly for example)
groups <- sample(c("oligotroph","copiotroph","unknown"), ntaxa(expt),replace=TRUE)

# over-write the former kingdom slot in the tax table with the functional group
tax_table(expt)[,"group"] <- groups

# merge taxa belonging to the same functional group
expt.groups <- tax_glom(expt,taxrank="group")

# visualise distribution of functional groups in each sample
plot_bar(expt.groups,fill="group")
audy commented 10 years ago

I was also thinking about how this could be useful. For now, I just add data about OTUs to the taxa table. Maybe this feature could be "added" by renaming tax_table to otu_data

joey711 commented 10 years ago

I could see an OTU-observations slot becoming useful as we begin to catalog more information about specific OTUs. In general, though, it seems information is pretty limited (or essentially non-existent) about the OTUs from a diverse community sample. I still see a lot of utility in keeping the taxonomy information separate from any additional OTU data. It might be there's a clever way to do both in the same table that's still clear and reliable. I'm not keen on simply renaming the tax_table slot and related functions. One major reason for this is that the taxonomy is currently stored as a character matrix, which isn't appropriate for general data storage because it doesn't support numerical data. I would consider switching tax_table to a data.table and simply keep track of which columns are taxonomy. That might be the best solution, and hopefully not too hard to implement. I will be looking into it. Happy to entertain additional suggestions about it.

@audy The other reason I'm not keen to rename tax_table is that it is a surprising amount of work to rename all the related functions, slots, documentation, etc., but accomplish nothing. tax_table is still vague enough that I could make it clear in documentation if additional data was also supported, and focus my extra development bandwidth on adding the functionality rather than tweaking the semantics.

davidelliott commented 10 years ago

I agree taxonomy and OTU data should ideally be separate, preferably in different tables but you could do it in one table as you suggest. If you made clear in the documentation that there is additional data about the taxa in the tax_table that would be fine from a user perspective. As the tax table has hierarchical information I thought this might not work very well, also could it affect performance if a lot of information is stored?

As for information about OTUs being limited, yes it certainly is for diverse community samples, but ultimately researchers often want to know about function and phyloseq helps us to simplify the experiment to make things more manageable. In the short term I expect an OTU data slot to be sparsely filled with manually input hard won information on perhaps a few OTUs of interest - e.g. from literature search, experiments, and the ongoing community analysis (e.g. co-occurrence groups). Later it could be filled by automated means bringing in information from the upstream analysis (e.g. the actual sequence, ID confidence, accession of ID) and databases like GenBank or metacyc.

Thanks

audy commented 10 years ago

One use case for an otu_data attribute in phyloseq is incorporating the predicted metagenome (KEGG orthologies/contributions) from PICRUST. Currently, I'm just using a psuedo taxonomy table for this.

joey711 commented 10 years ago

@davidelliott @audy Upon a night's rest and further reflection, I think a separate table probably makes more sense. There are ways to implement it as one table but have the same functionality for the user, but I can see how this would get confusing compared with the clear separation of other component types in phyloseq.

So @audy, the "pseudo taxonomy" table you're using has OTU IDs for rows, and predicted presence of different gene orthologs as columns? Definitely a good idea to have this as a separate table... I'll try to post some examples soon.

@davidelliott I like the forward looking idea, and wouldn't it be great to have automatically-populated experimental information for OTUs? Creating a helpful data representation is clearly the easy part :)

Thanks for the comments/discussion. I'll make a trial branch or two with some ideas that we can play around with.

willnotburn commented 9 years ago

hi all, I just want to add that allowing for an additional "otu_data" slot within a phyloseq object would be super-handy; picrust users are okay with throwing away most OTUs as the price for gaining information about the remaining few; making space for otu_data would help streamline a whole different category of analyses

audy commented 9 years ago

picrust users are okay with throwing away most OTUs as the price for gaining information about the remaining few

Care to elaborate?

willnotburn commented 9 years ago

I was referring to disadvantages of having to use closed-reference OTUs (qiime) aka phylotypes (mothur) in exotic and/or diverse environments. PICRUSt only works with phylotypes, excluding most of my de novo clustered OTUs, including some of the more abundant ones. It makes a difference in community-level ordinations using de novo clusters and a subset of those that could be run through PICRUSt.

Having said that, using PICRUSt results in a nice "otu_data" table. If ported to phyloseq, it could be used for functional analyses. Maybe phyloseq can parse KO group levels, just like it does taxonomy levels.

It would be exciting stuff in phyloseq: visualizing distribution of various KO levels as easily as phyloseq makes visualizing various Tax levels.

audy commented 9 years ago

Having said that, using PICRUSt results in a nice "otu_data" table. If ported to phyloseq, it could be used for functional analyses. Maybe phyloseq can parse KO group levels, just like it does taxonomy levels.

You can do this. I have done this. It's not even that difficult.

  1. Run the predict_metagenomes script and wait a few hours while PICRUSt does some multiplication.
  2. Take the output biom format file and convert it to CSV so that other programs and UNIX tools can read it. This is your "OTU" table but each "OTU" is now a kegg-id.
  3. Dig into the interals of PICRUSt and find the kegg-id -> KO mapping. Turn it into a proper CSV file and make that your taxonomy table (I still have mine; it's attached).

edit: here is my kegg-id to KO mapping file. I generated this a long time ago so I don't know if it's up to date: https://www.dropbox.com/s/0h2z9lhelfjt7ha/kegg-orthologies.csv.gz?dl=0

audy commented 9 years ago

I was referring to disadvantages of having to use closed-reference OTUs (qiime) aka phylotypes (mothur) in exotic and/or diverse environments. PICRUSt only works with phylotypes, excluding most of my de novo clustered OTUs, including some of the more abundant ones. It makes a difference in community-level ordinations using de novo clusters and a subset of those that could be run through PICRUSt.

This is unrelated to the issue but you could try doing closed reference OTU picking, then open-reference on any reads that didn't closely map to a reference (hybrid-OTU picking?). I believe QIIME and friends support this. USEARCH/UCLUST and CD-HIT definitely do as well. It's likely that a lot of your OTUs map closely to references.

DeniRibicic commented 8 years ago

Hi audy,

Would you please try to explain once more how to make changes and import predicted_metagenome into phyloseq. Step 3 is somehow confusing me, probably because I can't acces the provided dropbox link. Thanks, Deni

audy commented 8 years ago

@DeniRibicic Sorry, that file is long-gone. The idea is that you can load the KOs in the same way as you load the taxonomic descriptions.

The file would look something like:

KEGG_ID,KO_1,KO_2,KO_3,KO_4
KO1234,Metabolism,Breakfast,Waffle
KO2345,Metabolism,Lunch,BBQ Sandwich
...

I generated this file from one of the KO files that came with PICRUST.

You then load the "OTU" table which is generated by predict_metagenomes that maps the sample ID, KO and number of reads or abundance.

sample_id,sample_1,sample_2
KO1234,0,1
KO2345,1,1

Treat the former as the tax_table and the latter as the otu_table and you can trick Phyloseq into thinking KOs are OTUs and use all of the fancy Phyloseq plotting functions on your PICRUST output.

DeniRibicic commented 8 years ago

Hi audy,

I presume the files from PICRUSt you are referring to are the ko_precalculated 13-5 or 18-may? When I have a look into, they actually look different from each other. 13_5 don't have this Kegg description or pathways that I would need for a "taxonomy table". Anyways, I'm still puzzled about generating the "tax_table" file when looking into ko_18-may. This is probably due to the lack of knowledge regarding R and manipulating matrices. I guess I need to turn the rows into columns in this case? And in the example you provided me you had columns named: K0_1,K0_2...etc. is this just provisionally done? Would you mind explaining me how to manipulate such a matrix, because I can barely load it into R?

Best, Deni

iimog commented 7 years ago

Hi @joey711, is it possible by now to import OTU metadata tables and e.g. plot ordinations on that data? Or is it still necessary to import this information as pseudo-taxonomy tables? I have biom files that contain additional metadata for my (plant) OTUs. This metadata are traits like "flower color". I would love to have an ordination of my community colored by those traits. Best, Markus

joey711 commented 7 years ago

@iimog The only hard restriction on the data is that it is character-strings, which most data can be coerced to, and the example you give would already work in that regard. You can name the "taxonomic ranks" (AKA the table columns) whatever you want, so that's not at all restrictive. There is a notion of hierarchy built-in to the taxonomy table that would affect certain taxonomy-aware functions, like tax_glom, but this is a small restriction, especially if your additional taxonomy data is located to the right of the taxonomy columns.

This is a long-winded way of saying that yes, it is still necessary to structure the data in pseudo-taxonomy tables at the moment in phyloseq. The biomformat package should read your biom data just fine, however, and you can always use that as a convenient workaround for augmenting the plot tables embedded in your ordination ggplots (see myPlot$data).

I still think this would be cool feature for phyloseq to support, and the reason I've waited so long is that it is best solved with a more general restructuring of the phyloseq data classes that would better match with the biom-format. My effort will probably occur in the biomformat package, and then be exposed in phyloseq once it is ready.

Sorry in advance for what is probably a disappointing answer (for now). I hope some of my suggestions will help you accomplish your needs in spite of it.

Cheers

joey

iimog commented 7 years ago

Thanks for your quick and detailed answer. It is not at all disappointing. I'm looking forward to your implementation in the biomformat package and in the meantime the workaround using the taxonomy table works for me. Thanks again, Markus

nitschkematthew commented 5 years ago

Hi @joey711 After a quick read through this thread and regards to the restriction of otu metadata (in its current form in phyloseq - the taxonomy table) to character strings - there are certainly some useful cases for having a numeric otu metadata table.

For example, I use DADA2 to generate ASVs and then the inbuilt RDP classifier (assignTaxonomy) to annotate the ASVs. You can pass outputBootstraps = TRUE to assignTaxonomy and get back a numeric matrix of the boostrap confidence values for each taxonomic ranking. Because I often use custom reference databases, I like to do some exploring of the minimum bootstrap thresholds to see how well the classifier is performing, and filter out or set to NA the ASVs that get very low scores, and then see how this affects the data and the story it tells.

However, what this currently requires is filtering outside of phyloseq, remaking the phyloseq object, and the plotting, each time I want to explore a different cutoff value. It would be absolutely fantastic to have a slot for numeric otu metadata to be able to pipe in all of the DADA2 outputs!

Thanks for your hard work! Matt

TJrogers86 commented 4 years ago

Hello! I was reading through this, and was wondering if it is possible to add metabolic data table to a phyloseq object? Currently I have a phyloseq object that has an otu_table(), tax_table, and sample data(). I am working with metagenome assembled genomes (MAGs) and have predicted genes/metabolisms for each MAG. It would be amazing to be able to include these in my phyloseq object. I have attached an example of what I mean by a metabolic data table. Numbers correspond to percent of genes in pathway/ of metabolism in decimal form.
example_of_metabolic_table.xlsx

cjfields commented 3 years ago

I wonder if this is something that would be best handled with the proposed MicrobiomeExperiment framework? This seems to be picking up again: https://github.com/FelixErnst/MicrobiomeExperiment