Closed sfujio closed 9 years ago
@sfujio
Thanks for the interesting and detailed question.
Although perfectly understandable in principle, in practice what you have suggested is actually quite dangerous, as there are many reasons your data might not cluster with the HMP data that are trivial/artifactual, and it will be difficult to know the difference if, in fact, your data clusters separately.
The immediate data-wrangling problem you're having is one (of several) big limitations of OTU clustering. Most OTUs are defined by arbitrary thresholds in sequence-space through an clustering procedure that is at least somewhat dependent on the input data; thus the clusters defined by the HMP project are based on its input data, whereas the ones in your dataset are based on your input data. Hence, the ID numbers (which are arbitrary anyway) and the tree are not really compatible, and it would be a fool's errand to try to force them together OTU-by-OTU. Some studies get around this problem by sticking to only a closed-reference OTU table against the same reference for all datasets. In this case the OTU tables should be merged together before the tree is added, and the tree should be the full tree of the reference database.
Cautions aside, there are several ways to do this in phyloseq. Which way you choose depends on your answers to the questions above.
Hi Joey, thanks for your quick replay. In my sequecning project I used primers for V3-V4 region of 16S rRNA and I was comparing it to the sequences of V3-V5 region of 16S rRNA from the HMP ( we used different primers). I knew that it´s not that simple to just compare sequence data from different sequencing projects considering all the different decisions one has to make in the experiment and the processing of the sequences, but I had come across some papers were they made this kind of comparison that led me think that comparing data from V3-V4 regions vs V3-V5 should work as an approximation.. but I wouldn´t know for a fact if that comparison is valid or not.
Since the data of the HMP is from 2011, the way those sequences were process seem not that adequate currently. If I compare the algorithm used in that project and in mine, they are not the same. They used "Otupipe" which it says to be based on USEARCH for the OTU picking process, which I haven´t been able to figure which of the 3 kinds of OTU picking process it uses, seems like de novo. I used open reference with UCLUST for my data.
I processed my data with Qiime and was using the Qiime processed sequence data from the HMP.
Regarding the patient cohort both studies were done in healthy patients, with the same age range (18-40) but differ some for the BMI (The HMP had a higher mean value and greater dispersion). That´s all the information I have in hand to compare.
Since I couldn´t calculate Unifrac distances for the lack of a tree I went on using distances developed in ecology analysis. I found a paper by J Kuczynski - 2010 (doi:10.1038/nmeth.1499) where they evaluated other distances aside from Unifrac to asses the similarity of different microbiota data, and used the one (Canberra distance) that gave them the best results for clustering purposes. But then found in C Lozupone 2013 paper´s supplementary (doi: 10.1101/gr.151803.112) that Canberra distance didn´t perform all that well distinguishing different human samples ( oral, skin, gut etc).. Bummer. From what I have studied it seems that maybe Hellinger transformation with PCA could perform well for sparse abundance data but I lack the knowledge confidence to know for sure. There are so many distances and they give different results that I am not sure how to choose the best one. ( This is a bit off the topic but is it valid to calculate distances over transformed data? as in, transform my data with Hellinger transformation and then calculate the distance between samples, does that make sense? or is the Hellinger transformation some sort of distance it self? I know from the formulae that is just the square root of the sample´s taxas proportion, but I am confused about it). ah! also, since the HMP sequence data ranges from 100 sequences for some samples up to 11,000 to others I rarefied the samples to 10,000 to make the comparison more meaningful.. I know you are really against rarefying but It didn´t make sense to me to compare samples with only 100 sequences against my samples that have like 70,000 sequences per sample in average. I know that rarefying is loosing information, I went from 319 samples of the HMP to 39 samples, but since my study has only 41 samples, I considered it was a good enough sample size to compare.
In the end, I get the feeling that the only proper way to make the comparison I want to do is by taking the raw data of both studies and process them together from scratch in order to have some sort of fair comparison. Only problem is that I don´t have the computer power to do that kind of analysis now...
Sorry for the long post! thank you for your time. It would be great if you could help me figure this out since I am very swamped.
Cheers.
"that comparing data from V3-V4 regions vs V3-V5 should work as an approximation.. but I wouldn´t know for a fact if that comparison is valid or not."
The data I've seen to date leads me to think the opposite. Different primers, different amplicon lengths, and so forth result in different PCR biases from one approach to another. Even if you cluster them altogether and/or use a reference only approach, my knee-jerk prediction is that the samples from your study will cluster separately from the HMP samples at a very high level (explain one of the first few axes in an ordination).
The distances and other questions related to how you should analyze your data are off-topic for this post, and not necessarily appropriate for this issue tracker.
Since the merge question is a clear phyloseq question, I'll quickly solve that one.
merge_taxa
to get around this, but this is meaningful only if the taxonomic classifications were made with the same reference.merge_phyloseq
merge_phyloseq
.Hello, I am trying to compare the results of DADA2 to QIIME on a set of samples I have analyzed. The same sequencing info for the same set of samples was fed into each pipeline for analysis. I have created 2 different phyloseq objects for each (including OTU table, sample data, and tree). I'm having a little trouble figuring out how to compare these data. Do I merge the phyloseq objects or is there a way to compare them while separate? If I merge them, I did add a column in each mapping file with the Pipeline info (i.e. DADA2 or QIIME) but would merging the sample data work since they have the same sample names? Also, sorry but I'm a little unclear about how to use merge_taxa to match the OTU IDs (we did use the same reference for classification)
Thanks for your help!
Hi Joey, In your post above you wrote:
In my case i have three objects P1 P2 P3 made with the same reference but with different OTU IDs . The three objects correspond to different human tissues but all the extractions, amplification and primers were exactly the same.
>P1
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5854 taxa and 565 samples ]
tax_table() Taxonomy Table: [ 5854 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 5854 tips and 5852 internal nodes ]
> P2
phyloseq-class experiment-level object
otu_table() OTU Table: [ 7689 taxa and 148 samples ]
tax_table() Taxonomy Table: [ 7689 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 7689 tips and 7687 internal nodes ]
> P3
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5981 taxa and 100 samples ]
tax_table() Taxonomy Table: [ 5981 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 5981 tips and 5979 internal nodes ]
Goign through the documentation on merge_taxa it is not clear for me how to merge this three objects into one object with a tree. What would be the exact command line ?
Great if you could help on this ?
Can we bring this thread back to life please? Joey, I see above that you strongly recommend not combining datasets that may use different primers, different classifiers, or other parameters, but I have the unique situation of having 2 datasets that were created from the exact same conditions (as far as I know), just using different sequencing runs. I separately ran the qiime2 pipeline for each sequencing run, and now have 2 phyloseq objects that I'm attempting to merge in R.
Rather than re-run my qiime2 pipeline with both raw datasets combined, which takes a long time for all of these samples on my not-too-fast computer, it would be much easier to simply run some R code to merge the seemingly merge-able phyloseq objects, but I get the same error as the person at the beginning of this thread.
Can you help by giving more details on your previous suggestion for how to merge, which also isn't working for me? Thanks, I'm desperate.
Extreme Caution, Re: merging data from mismatched primers
"that comparing data from V3-V4 regions vs V3-V5 should work as an approximation.. but I wouldn´t know for a fact if that comparison is valid or not."
The data I've seen to date leads me to think the opposite. Different primers, different amplicon lengths, and so forth result in different PCR biases from one approach to another. Even if you cluster them altogether and/or use a reference only approach, my knee-jerk prediction is that the samples from your study will cluster separately from the HMP samples at a very high level (explain one of the first few axes in an ordination).
The distances and other questions related to how you should analyze your data are off-topic for this post, and not necessarily appropriate for this issue tracker.
How-To Merge
Since the merge question is a clear phyloseq question, I'll quickly solve that one.
- The OTU IDs between the studies need to match up. You can use
merge_taxa
to get around this, but this is meaningful only if the taxonomic classifications were made with the same reference.- Merge the OTU tables first, and add the full reference tree to this after the merge,
merge_phyloseq
- merge the sample data separately, then add it to the phyloseq object,
merge_phyloseq
.
Hi all, I wish to compare my sequencing results of stool samples with the data available in the Human Microbiome Project. What I want to do is to calculate Unifrac distance between all the samples and do a PCoA and see if my samples cluster apart from the ones of the HMP.
I have managed to create a phyloseq object of the HMP data, with the otu and tax table and the phylogenetic tree. The object also has the sample data so I could subset it to only keep the stool samples.
After applying some filtering criteria, I end up with 4632 taxa for my samples and 4568 taxa for the HMP´s stool samples.
The thing is, in order to calculate the unifrac distance I need to merge the 2 phlyloseq object ( mine and the HMP one) but since they were made in different experiments the OTU labels are different. I was thinking I could maybe use merge_phyloseq_pair for merging the otu table and tax table but for the phy tree, which is suggested to be merged using consensus function of ape, the merging doesn´t work because they have a different number of tips (Error in FUN(X[[i]], ...) : one tree has a different number of tips). Also I have read online (https://www.biostars.org/p/9433/) that the branch length of the consensus trees are all the same, which wouldn´t be so useful for the unifrac calculation purpose ( I suppose..).
Is there any way of doing what I want using phyloseq? Or do I have to go back to the sequence processing part and start doing the OTU picking process and the phy tree generation using my sequence and the HMP sequence so I can truly compare them?
Hope someones can help me, because I have been struggling with this for days, and need to get the result ASAP.
Thanks, cheers!