joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

Add external reference sequences to trees #1150

Open RJ333 opened 5 years ago

RJ333 commented 5 years ago

Hello everyone,

I came repeatedly across the situation that a tree built from my own sequence data is interesting, but would be more robust and meaningful, if I would add external reference sequences. Assume you study biodegradation and can include full 16S rRNA gene sequences from known degraders.

Using the ape or phangorn package, this is most likely not a problem. However, after importing data to phyloseq we lose all items which are not present in all the object slots. This means I would have to create an artificial sample, add the taxonomy manually and make up some sample data.

Although this is totally possible, I was wondering if there could be a more convenient way to do this? Basically I would wish that I could just add some sequences to the refseq slot within phyloseq and then build a tree with it. What do you think about this?

mikemc commented 5 years ago

For this sort of thing I would just work with the DNAStringSet and tree contained in the refseq(ps) and phy_tree(ps) slots individually, and then there is no problem. However, if there is a reason why you'd like to have everything in a phyloseq object (perhaps to use phyloseq's plot_tree function), then adding these taxa to the OTU table and setting their abundances to 0 will be necessary and seems like a perfectly fine thing to do. There is no need to create an artificial sample. You would also need to add them to the tax table, though you could set the taxonomy to NA's if you don't know it.

RJ333 commented 5 years ago

Hey @mikemc, thanks for the quick response.

My idea was to create an artificial reference library, which therefore would require entries for sample_data, otu_table, tax_table, and refseq, right? That way it would be easy to remove the added sequences by removing the sample if required.

But your solution of just amending the Otu, Taxonomy and Refseq entries and setting the abundances to 0 also sounds like a good idea, they could be as easily removed by subsetting if they are not needed anymore.

But I'm actually not sure how to accomplish this, how would you add e.g. a single OTU with taxonomy and sequence to an existing phyloseq object?

mikemc commented 5 years ago

When you say "artificial reference library" do you mean with just the external sequences, or with all the sequences?

Which route you go seems to me to depend on what type of downstream visualization and analysis you are trying to do. Just to make sure we're on the same page, for building a tree using e.g. phangorn, the starting point is just an alignment in (if I recall) a DNAStringSet object, so you just need to combine your external sequences with your own sequences into a DNAStringSet object, there is no point in making a phyloseq object first. What is it you want to do after that?

Some points that might be helpful: As currently implemented, Phyloseq objects always require an otu_table, but everything else (including sample data) is optional. The merge_phyloseq() function can be used to merge phyloseq objects with different taxa or samples, but it can't merge trees with different taxa.

RJ333 commented 5 years ago

Yes I also looked at the merge_phyloseq option, but I'm not sure if I can use it that way.

So my goal would be (optimally) to be able to add external sequences to an existing phyloseq object. Say I've performed my experiment, I'm analysing my data using phyloseq, I built a tree using e.g. phangorn and when my analysis detected important OTUs and I look them up in the plot_tree result I'm wondering how close these OTUs from my experiment might be to a known sequence of e.g. an cultivated organism described in literature.

I'm sure that it would be no problem to do this outside of the phyloseq environment (and the tree calculation itself is, as I understand, independent of phyloseq anyway) but I would like to make use of the phyloseq functionality of linked entries and therefore have the additional sequences within the object. This would allow e.g. convenient subsetting and tree plotting. I actually use phyloseq as "formal standardization" to guarantee data integrity and I try to stay within the environment as long as possible. And if I would calculate the tree outside of phyloseq, wouldn't it be cut down to only those OTUs that are present within the phyloseq object if I try to merge it?

It would probably not be complicated to manipulate a given OTU table etc. before importing them to phyloseq, but I assume at that point I wouldn't even know which reference sequences could be of interest and should therefore be added later on.

Speaking of a reference library I was thinking of adding a synthetic sample, which only contains the external reference sequences that I want to add to the tree calculation. This synthetic sample would get respective linked entries to otu_table, sample_data and tax_table before merging. This sample then could easily be removed if hindering other analysis and selectively marked in the plot_tree based on the sample_data entry. But if this is too much work I will try this outside of phyloseq.

Besides that I would still be interested in how to add e.g. an OTU to an existing otu_table or tax_table in phyloseq? Is this possible?

Thanks for your effort!

mikemc commented 5 years ago

That all makes sense and it is certainly possible to create a single phyloseq object with the experimental data as well as the reference taxa. Whether you create an artificial sample that has just reference taxa is entirely up to you. It seems like the main reason to do this is to be able to easily select the reference taxa. Other ways to do this are to prefix their OTU names with e.g. "ref", or add a final column to the tax table that indicates whether the taxon is a reference taxon or not. The reason why I'm spending so much time on this is that working out the end point is essential before working out what specific steps to take to add the reference taxa to the starting phyloseq object.

Supposing that you do want an artificial sample, then I think the following should work fine. Take ps to be your initial phyloseq object for your experimental data with an empty tree slot. Then, create a new phyloseq object, ps.ref that has just the reference taxa and a single sample called "Reference" or equivalent. You will need to add a sample_data object for the artificial sample to be added when you call merge_phyloseq in the next step; otherwise, the taxa will be added and the sample will be dropped, which as I noted above is also fine. Add taxonomy and reference sequences to ps.ref as well if you want to have those in the final phyloseq object. Then, run ps.merged <- merge_phyloseq(ps, ps.ref). Then, create a phylogenetic tree with all the taxa, and add this to ps.merged with ps.merged <- phyloseq(ps.merged, tree).

I suggest first working through this with a very small subset of your data

ps0 <- prune_taxa(taxa_names(ps)[1:5], ps)
ps0 <- prune_samples(sample_names(ps0)[1:5], ps0)

so that you can visually make sure that the result of the phyloseq components after calling merge_phyloseq() is as you expect. The reference taxa should all have abundances of 0 in the otu_table() in the real samples, for example.

Once you have created a DNA string set with the sequences for the reference taxa (I'll call this rs), you can make the sample_data and otu_table for ps.ref by

otumat <- matrix(1, nrow = 1, ncol = length(rs))
colnames(otumat) <- names(rs)
rownames(otumat) <- "Reference"
OTU <- otu_table(otumat, taxa_are_rows = FALSE)

SAM <- sample_data(ps)[1,]
SAM[,] <- NA
sample_names(SAM) <- "Reference"
# or : rownames(SAM) <- "Reference"
ps.ref <- phyloseq(OTU, SAM, rs)
RJ333 commented 5 years ago

So, first of all I would really like to thank you for the energy you spent on this topic. I followed your instructions and it worked out exactly as I wanted, just make sure to use ps.merged <- merge_phyloseq(ps.merged, tree) at the end instead of <- phyloseq().

For testing purposes, I used the last 5 entries of my existing fasta file of OTU representatives, replaced the header with Ref and read it as new DNA string set

# command line
tail OTU_reps_fasta_002.fasta | sed 's/Otu/Ref/g' > example_ref.fasta
# this in R
library(Biostrings)
reference_seqs <- readDNAStringSet(file = "example_ref.fasta",
  format = "fasta", nrec = -1L, skip = 0L, seek.first.rec = FALSE, use.names = TRUE)
rs <- reference_seqs

I added the taxonomy using this small chunk of code (for 5 reference sequences with 6 taxonomic levels, which is appropriate for my mothur-imported data), which might be helpful to others:

# required libraries
library(dplyr)
library(tidyr)

# generate taxonomy levels from strings
tax_df <- data.frame(wholetax = c("level1.level2.level3.level4.level5.level6",
  "peter.paul.mary.joe.chandler.rachel",
  "one.two.three.four.five.six",
  "and.another.line.and.another.line",
  "now.it.is.enough.with.examples"))
rownames(tax_df) <- names(rs)  
tax_df <- tax_df %>% separate(wholetax, c("Rank1", "Rank2", "Rank3", "Rank4", "Rank5", "Rank6"))
ref_taxa <- tax_table(as.matrix(tax_df))
> ref_taxa
Taxonomy Table:     [5 taxa by 6 taxonomic ranks]:
          Rank1    Rank2     Rank3    Rank4    Rank5      Rank6
Ref116222 "level1" "level2"  "level3" "level4" "level5"   "level6"
Ref116223 "peter"  "paul"    "mary"   "joe"    "chandler" "rachel"
Ref116224 "one"    "two"     "three"  "four"   "five"     "six"
Ref116225 "and"    "another" "line"   "and"    "another"  "line"
Ref116226 "now"    "it"      "is"     "enough" "with"     "examples"

ps.ref <- phyloseq(OTU, SAM, ref_taxa, rs)

the tree calculation was then performed as described in https://f1000research.com/articles/5-1492/v2:

#load libraries
library(DECIPHER)
library(phangorn)
# calculate tree
seqs <- refseq(ps.merged)
alignment <- AlignSeqs(seqs, anchor = NA)
phang.align <- as.phyDat(alignment, type = "DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data = phang.align)
fitGTR <- update(fit, k = 4, inv = 0.2)
fitGTR <- optim.pml(fitGTR, model = "GTR", optInv = TRUE, optGamma = TRUE,
                       rearrangement = "stochastic", control = pml.control(trace = 0))
detach("package:phangorn", unload = TRUE)
# merge ps object with tree
ps.merged <- merge_phyloseq(ps.merged, phy_tree(fitGTR$tree))

and so, starting with a phyloseq object without tree or reference sample/sequences

> ps0
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 10 taxa and 10 samples ]
sample_data() Sample Data:       [ 10 samples by 13 sample variables ]
tax_table()   Taxonomy Table:    [ 10 taxa by 6 taxonomic ranks ]
refseq()      DNAStringSet:      [ 10 reference sequences ]

I now added a sample that includes sequences based on my requirements ready for tree plotting etc

> ps.merged
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 15 taxa and 11 samples ]
sample_data() Sample Data:       [ 11 samples by 13 sample variables ]
tax_table()   Taxonomy Table:    [ 15 taxa by 6 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 15 tips and 13 internal nodes ]
refseq()      DNAStringSet:      [ 15 reference sequences ]

Thank you very much!

jimenaortiz20 commented 3 years ago

Hello everyone,

I want to do the same, add a reference sequence to my phylogenetic tree made in phyloseq. I downloaded a sequence from the gene bank (NCBI) and followed the steps they indicated, but it didn't come out.