joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
582 stars 187 forks source link

Merge taxonomy table and frequency table with multiple samples #1163

Closed henganl2 closed 4 years ago

henganl2 commented 5 years ago

Hi,

My data is ITS+LSU regions from oxford nanopore sequencer and contain 96 samples. I blast and generate the feature table (TAX) and frequency table (OTU) by each sample.

What should l do if I want to combine the 96 samples and generate only one phyloseq object and run the analysis?

I've tried to use two samples to run the following code: mergephyseq <- merge_phyloseq_pair(physeq15, physeq75) but this one will merge two samples into one.

I've also checked merge_sample and merge_tax, both of them doesn't seem to be the way to go.

It will be appreciated if you can give me some suggestion! Thanks in advance.

mikemc commented 5 years ago

What is physeq15 and physeq75? If these are two phyloseq objects with different sets of samples, then you should look into the merge_phyloseq() function (without the _pair) at the end.

henganl2 commented 5 years ago

Hi @mikemc,

Thanks! physeq15 and physeq17 are two phyloseq objects with two different sets of samples. The merge_phyloseq() works!

Thanks again!

henganl2 commented 4 years ago

Hi @mikemc,

I have a follow-up question about this issue. So I've been using merge_phyloseq to merge my phyloseq objects for a while, but I found a problem recently. When I merge multiple phyloseq objects with different tax table and frequency table, it seems like it merge the OTUs based on the OTU number, not the taxonomy information.

To make my question more clear, I have an example below

physeq01 phyloseq-class experiment-level object otu_table() OTU Table: [ 1212 taxa and 1 samples ] tax_table() Taxonomy Table: [ 1212 taxa by 7 taxonomic ranks ] refseq() DNAStringSet: [ 1212 reference sequences ]

physeq02 phyloseq-class experiment-level object otu_table() OTU Table: [ 6030 taxa and 1 samples ] tax_table() Taxonomy Table: [ 6030 taxa by 7 taxonomic ranks ] refseq() DNAStringSet: [ 6030 reference sequences ]

test <- merge_phyloseq(physeq01, physeq02) test phyloseq-class experiment-level object otu_table() OTU Table: [ 6498 taxa and 2 samples ] tax_table() Taxonomy Table: [ 6498 taxa by 7 taxonomic ranks ] refseq() DNAStringSet: [ 6498 reference sequences ]

So I check the otu_table and tax_table from those phyloseq objects by just pick out one OTU

In physeq 01 count taxonomy
OTU3 113 Alternaria

In physeq 02 count taxonomy OTU3 109 Fusarium

In test
count taxonomy OTU3 113 109 Alternaria

So it actually merging two different OTUs into the same one I think the problem is similar to #574

Any suggestions for solving this problem?

Thanks!!

mikemc commented 4 years ago

Phyloseq uses the otu/taxa names (as given by taxa_names(physeq) as the fundamental identifier of an otu/taxon. If you want to merge phyloseq objects, then it is very important to either make the taxon names consistent, or make them completely distinct. That is, if you have taxa named "OTU3" in physeq01 and physeq02, these must mean the same OTU if you are going to merge them. If OTU3 means different things in each, then you should change the names before merging. If you plan to do taxonomy-based rather than OTU-based analysis, then you could just make the OTU names in each phyloseq object unique by, for example, adding physeq01_ to the beginning of the OTU names from physeq01,

taxa_names(physeq01) <- paste0("physeq01_", taxa_names(physeq01))

and similarly for physeq02, before merging. That way the OTUs from the two phyloseq objects will be kept separate, but you can still merge them by taxonomy using tax_glom(), e.g. to the genus level.

If this still doesn't make sense, it might help to think back on how you created your OTUs and taxonomy assignment, and read up on the different types of OTUs (e.g. closed vs. open reference), and the challenges with using OTUs instead of ASVs when merging amplicon datasets (for the latter, see http://www.nature.com/doifinder/10.1038/ismej.2017.119)

joey711 commented 4 years ago

@mikemc thanks, great answer.

@henganl2 if you actually want to compare, and use Mike's suggestion of agglomerating to the species or genus (or higher) level, you would use tax_glom() first on each object, and then merge_phyloseq() on the two tax_glommed objects. This approach requires that your taxonomy assignments were using the same method and reference database. And the caveats alluded by @mikemc are also important. This approach can work for some biological questions, though. But I would not recommend it as a general practice. The best approach is if your sequences come from the same target loci (e.g. V4), then you also have the option to set the "OTU ID" based on the ASVs themselves, or a short identifier that is consistent across ASV sequences in the two datasets. This requires that they were trimmed down to the same positions of the loci, and if this is not the case, you can just re-run denoising (e.g. dada2) after fixing the trimming to be consistent between the two datasets. It's a little extra work, but the advantage of being able to track the same biological sequence across all your data is a pretty large gain in interpretation, especially if the taxonomy database is not providing sufficient coverage or resolution for your research problem.

Hope that helps!

joey711 commented 4 years ago

I will close for now, but feel free to re-open/comment as-needed.