liamrevell / phytools

GNU General Public License v3.0
207 stars 56 forks source link

SCM extracting values at internal nodes (or an approximation of it) #94

Closed HedvigS closed 3 years ago

HedvigS commented 3 years ago

I'm doing ancestral state reconstruction of discrete traits using cultural data. I really like phytools and I've been using many of its functions. I'm comparing ASR done by just plain max parsimony to results from max likelihood (using corHMM::corHMM) and SCM (phytools::make.simmap()). Then I'm also comparing those three sets of results to reconstruction done "by hand" by distinguished scholars in my field (comparative linguistics). These scholars have reconstructed specific states at specific internal nodes, in my case proto-languages. For example, "proto-Polynesian had an absence of trait 24" and so on. For max parsimony and max likelihood I can extract the state of a particular internal node and compare this to the "manual" reconstructions. However, from make.simmap() I get out time spent in state along branches - not the state of internal nodes. I've read Bollback (2006) and, though I'm mostly self-taught in tree-thinking, I think I understand why this is. In a sense, and please forgive me if I'm wrong, nodes don't really exist in a meaningful way for SCM since it's all about the edges.

Is that a correct reading of what SCM does? And is there something I can do to get something comparable to state at specific internal nodes? Take the time spent in state right before and right after a particular internal node? Hm, sounds fishy.

Grateful for any and all advice.

liamrevell commented 3 years ago

If what you're interested in is get the posterior probability that each node is in each state from stochastic mapping, you should do a set of stochastic maps (e.g., 100 or 1,000) and then run the function summary on the "multiSimmap" object.

Here's an example using one of the datasets that comes with phytools:

library(phytools)
data(sunfish.tree)
data(sunfish.data)   
fmode<-setNames(sunfish.data$feeding.mode,rownames(sunfish.data))   
sunfish.maps<-make.simmap(sunfish.tree,fmode,model="ARD",nsim=100)   
sunfish.result<-summary(sunfish.maps)   
plot(sunfish.result,ftype="i",fsize=0.8)   
legend("topleft",c("non-piscivorous","piscivorous"),   
    pch=21,pt.bg=palette()[1:2],pt.cex=2)   
sunfish.result$ace

The last object is the set of posterior probabilities at all the nodes from stochastic mapping. Using the default method, these will eventually converge on the ML marginal likelihoods; however, stochastic mapping also allows you to incorporate uncertainty about the evolutionary process for the discrete state.

E.g.:

sunfishMCMC.maps<-make.simmap(sunfish.tree,fmode,model="ARD",
    Q="mcmc",nsim=200)
sunfishMCMC.result<-summary(sunfishMCMC.maps)
plot(sunfishMCMC.result,ftype="i",fsize=0.8)
legend("topleft",c("non-piscivorous","piscivorous"),
    pch=21,pt.bg=palette()[1:2],pt.cex=2)
sunfishMCMC.result$ace
HedvigS commented 3 years ago

Oh thank you! That's very neat, thanks.

Then, is it correct to think about SCM and marginal ML this way: SCM mainly differs from marginal ML in terms of extracting internal node states in that you also get information about uncertainty.

HedvigS commented 3 years ago

Would it be alright if I just I ask some questions to make sure I understand the values of this new multisimmap object I've learned to make :)?

For thesimmapobject I get when running make.simmap(nsim = 1), I'm making use of Q and loglik, besides the internal states themselves. These values are similar to what I get out of corHMM::corHMM() and what I remember from ape::ace(). For multisimmap, I've got count, timesand ace. ace is what I'll be using for getting the specific internal nodes I'm after, that should be clear to implement I think given how I do it for corHMM.

However, I'm feeling a bit uncertain of how to calculate the summary Q for a multisimmap object.

Forgive me if this is a stupid question.

HedvigS commented 3 years ago

With the help in this issue thread I am now able to extract the ancestral states of internal nodes computed by SCM. The ace object contains both the nodes and tip states, which surprised me a bit but is possible to work with. I'm still unsure of how to grab the uncertainty information, but I'll figure it out hopefully.

HedvigS commented 3 years ago

Oh, maybe this:

ace - the posterior probabilities of each node being in each state, if tree is an object of class "multiSimmap".

in the phytools documentation (page 53) should be changed to:

ace - the posterior probabilities of each node and tips being in each state, if tree is an object of class "multiSimmap".

Unless I've misunderstood what's going on?