rmcminds / stan_models

collection of models written in Stan, with data wrangling code to run the models on real data
1 stars 2 forks source link

implement clade-by-clade variance modifiers for codivergence estimation #6

Closed rmcminds closed 6 years ago

rmcminds commented 6 years ago

A model with clade-by-clade effects (as implemented in my hierarchical model or with multivariate normal and S^-1 as in Hadfield) will test whether there is a general association between clades, but doesn't necessarily test codivergence. For instance, if all bacteria are included in a model, the MRCA of all Endozoicomonas might be significantly associated with the ancestor of all Scleractinians, but the distribution of the various Endo types among Scleractinian species could still be random or uniform. The variance associated with the Hadfield cophylogeny term is a better measure than the effects themselves because it implies that there are multiple clade-by-clade associations that lead to more specificity, not just one such association. However, since it is applied globally to the entire dataset, it doesn't allow us to identify the clades in which such specificity is greater (GCMP Oz paper, for instance, suggests that cophylogeny is actually restricted to one sub-clade of Endozoicomonaceae not the whole family, even though the whole family was tested all at once and there was 'significant' cophylogenetic variance). A more general model might allow not just 'effects' to vary clade-by-clade, but the variance itself. The variance term should be identifiable relative to the 'effect' term because it would be applied to all descendant nodes, allowing some to have 'effects' that are strongly counter to the higher-level ones.

I imagine giving the root variance a positive-bounded prior such as exponential(1), and then giving each node a multiplier defined by a lognormal(0,sigma) prior. The actual variance used for the effect of a given node would be the product of its own multiplier, all the ancestral multipliers, the root variance, and the leading branch length.

Current issue is that my models have been partitioning variance with a grand mean and a dirichlet, but that wouldn't really be compatible (or at least it wouldn't be as interpretable) when combined with the above scheme. Would need to allow each node to have unconstrained variance, rather than lying on the simplex. Or, I suppose I could use a softmax transform of the sum of normal accumulators, and instead of having an exponential at the root just have nothing there at all.

rmcminds commented 6 years ago

implemented fully with commit 92f8819794bac5cf21324b8465ebf77c7d4d7c63