revbayes / revbayes

Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language
http://revbayes.com
GNU General Public License v3.0
56 stars 25 forks source link

[Feature Request] Inference under a conditional #492

Open ms609 opened 1 month ago

ms609 commented 1 month ago

The problem

I am developing a model in which I observe data x under the model M(θ) with parameter θ, and know that the data x satisfy condition C.

I wish to perform inference under the conditional, which requires the calculation of

π(x | θ, C) = Pr(x | θ) / Pr({X satisfies the conditional} | θ)

Pr(x | θ) can be computed with

m_morph ~ dnPhyloCTMC(tree = phylogeny, ...)
m_morph.clamp(x1)
m_morph.lnProbability()

Pr({X satisfies C} | θ) is given by

conditional ~ dnPhyloCTMC(tree = phylogeny, ...)
conditional.clamp(x2)
conditional.lnProbability()

Hence π(x | θ, C) is given by

likelihood = m_morph.lnProbability() - conditional.lnProbability()

So far, so good. Now I wish to run an MCMC(MC) using the conditioned likelihood.

mymodel = model(phylogeny)
mymcmc = mcmc(mymodel, ...)
mymcmc.run() # will assume likelihood = m_morph.lnProb() + conditional.lnProb()

The solution

The key function is getModelLnProbability in src/core/analysis/mcmc/Mcmc.cpp (and its counterpart in Mcmcmc.cpp).

double Mcmc::getModelLnProbability(bool likelihood_only)
{
    double pp = 0.0;

    const std::vector<DagNode*> &n = model->getDagNodes();
    for (std::vector<DagNode*>::const_iterator it = n.begin(); it != n.end(); ++it)
    {

        DagNode *the_node = *it;
        if (likelihood_only == false)
        {
            pp += the_node->getLnProbability();
        }
        else if (the_node->isClamped() == true )
        {
            pp += the_node->getLnProbability();
        }

    }

    return pp;
}

The key question here is how to identify a distribution as a conditional. We need to either an additional if statement to replace pp += with pp -= when a node is identified as representing a conditional distribution, or to design a conditional node such that its getLnProbability() method returns a negative value.

Possibilities include:

  1. Specify something in the model() object that identifies nodes as conditional. This seems like a non-starter – it is too late in the workflow, and would require a dramatic change to how the user interacts with that function; specifying an arbitrary node from the model graph would no longer suffice.
  2. Create a counterpart function to dnPhyloCTMC (and, ultimately, all distribution functions) that returns the inverse probability. This feels cumbersome and would introduce significant duplication to the code base.
  3. Add an inverse argument to all distribution functions, with a default value of FALSE, whereby the value returned by the getLnProbability() method is negated if inverse = TRUE.

The latter approach seems the most promising; it does not bloat the number of functions and allows the user to define the probability distribution as representing a conditional at its point of definition. It also avoids any requirement to modify other functions relying on probability calculation (e.g. move proposals; monitors).

Next steps

I've spent a while trying to install RevBayes from source, to no avail (yet).

As my familiarity with the RevBayes code structure, and with C++ more generally, is somewhat limited, it would be great to be able to get support from people better acquainted with RevBayes! Specifically, is the proposed implementation reasonable, or would another approach be better?

I don't know whether the task is trivial enough that someone in the development team might be able to help with the implementation? If not, I've seen (but not yet digested) the documentation on creating new functions, and best practice, but haven't quite got my head around how this fits in to the structure of specific code; an overview of what would be involved would be helpful.

bjoelle commented 1 month ago

can you clarify what you mean by conditional ? if I follow your proposal right, this would only work for conditions that can be expressed as a specific outcome (x2 in your example), is that right ? for the specific question here, it seems that this could be done through a function that would be applied to the distribution ? so you would have conditional ~ fnInverse(dnPhyloCTMC(....)) and then fnInverse would work in a generic way

ms609 commented 1 month ago

Yes, an fnInverse function sounds promising.

A simple example would be calculating the probability of an observation from a normal distribution with mean zero, conditioned on only observing positive outcomes. Unconditioned outcomes are equally likely to be positive or negative; so the likelihood of a specific positive value of x (say 0.42) is twice as high if we condition on x > 0 as it would be without conditioning.

In this simple case, we'd see

xObs <- 0.42
normal ~ dnNormal(mean = 0, sd = 1)
normal.clamp(xObs)

pXPositive = 0.5 # Because we can ignore the 50% of draws from dnNormal that are not positive

likelihood = normal.probability() / pXPositive

In this case our condition is constant; perhaps more analogous to my example would be a case where the condition depends on a variable: perhaps we have a second variable xMin that provides a minimum observable value for xObs, so x is known to be greater than xMin. In this case, our pXPositive could be replaced by p(x > xMin).

bjoelle commented 1 month ago

This seems to be a slightly use case than what you presented earlier ? you are not clamping the condition to anything here In this situation the easiest may be to reject outcomes which do not fulfill the condition ? this is how we deal with fossil age uncertainty for instance, by adding another distribution which will give likelihood=0 if the sampled fossil age is outside the range

ms609 commented 1 month ago

Yes, that's a somewhat simplified example; the actual use case is somewhat more complicated to explain, and I'm finding it difficult to construe an artificial example that has all the features of my real use case, without disclosing details that aren't mine to put in the public domain.

I think the initial example captures the key features: the model likelihood is obtained by dividing the probability of observing morphological characters x1 under dnPhyloCTMC1 by the probability of observing 'control' characters x2 under dnPhyloCTMC2. Perhaps I should articulate that whilst the characters x1 and x2 do not change, the rate matrices passed to dnPhyloCTMC1 dnPhyloCTMC2 rely on a separate parameter under inference.

I hope this isn't too obscure to capture what I'm trying to do – sorry not to be able to give a complete picture of the problem!

bjoelle commented 1 month ago

no problem, just trying to get a clear picture of what you're trying to achieve :) in that case I think nesting dnPhyloCTMC2 in an inverse function as I proposed above is probably the simplest/most portable and it would allow this to be done with any other distribution if needed - we do this type of structure already to add topology constraints to any tree priors for instance

ms609 commented 1 month ago

Great, thanks! I will work through the implementation docs and see how I get on; are there any existing functions that spring to mind as an obvious template I could use?

bjoelle commented 1 month ago

I think PR #466 is the closest example to what you want to do

bredelings commented 1 month ago

So, specifically, we might want to write something like

x ~ dnNormal(0,1) | x > 0

In general, we might have something like

x ~ distribution | condition

The probability would then be

Pr(x and condition)/Pr(condition) = Pr(x) * Pr(condition|x) / Pr(condition).

If condition is a boolean function of x, such as x > 0, then Pr(condition|x) reduces to 1 if condition(x) else 0.

The tricky part is Pr(condition(x)). This depends on the distribution. That is Pr(x > 0) equals 1/2 under Normal(0,1), but equals 1 under Gamma(2,2). So, whether or not we can handle a particular condition depends on whether the distribution "understands" that condition.

Previously, the only case we have handled conditions is INSIDE the phyloctmc function. That is, we have some conditions on whether specific characters are variable. And maybe some more complicated conditions for fossil morphology that @wrightaprilm might know about?

bredelings commented 1 month ago

Also, @ms609, you might want to have a look at src/core/distributions/phylogenetics/substitution/PhyloCTMCSiteHomogeneousConditional.h

bredelings commented 1 month ago

And src/revlanguage/distributions/phylogenetics/character/Dist_phyloCTMC.cpp. The coding option to phyloCTMC specifies ascertainment conditions. For binary options or restriction enzymes, the options are "noabsencesites", "nopresencesites", "informative", "variable", "nosingletonpresence", "nosingletonabsence", "nosingletons", and "all". For other alphabets, the options are just "informative", "variable", and "all".

ms609 commented 1 month ago

I've made some headway here. I have implemented dnInverse such that

x ~ dnExp(lambda = 1)
x.clamp(42)
x.lnProbability() # = -42

# Now calculate the inverse
invX ~ dnInverse(dnExp(lambda = 1))
# Clamp the value of the draw from the exponential distribution
invX.clamp(42)
invX.lnProbability() # = 42

However, when I attempt dnInverse(dnPhyloCTMC(...)), I hit an issue. I was expecting dnPhyloCTMC to return an object of class Distribution__AbstractHomologousDiscreteCharacterData, per its documented domain type, analogous to dnExp returning Distribution__RealPos. Instead I see:

   Error:       No overloaded function 'dnInverse' matches for arguments ( Dist_phyloCTMC )
   Potentially matching functions are:
   dnInverse (Distribution__Integer<any> distribution)
   dnInverse (Distribution__Natural<any> distribution)
   dnInverse (Distribution__Probability<any> distribution)
   dnInverse (Distribution__Real<any> distribution)
   dnInverse (Distribution__RealPos<any> distribution)
[...]
   dnInverse (Distribution__AbstractHomologousDiscreteCharacterData<any> distribution)

What function do I need to implement in order to pass a phyloCTMC object to dnInverse? Distribution__AbstractHomologousDiscreteCharacterData[], perhaps?

Sorry if I'm not explaining myself very clearly, as a non-C++ native, I've faced a bit of a learning curve to get my head around the code. If it helps, you can see my code here.

bjoelle commented 1 month ago

type(x) should give you the type of object x I believe ? also you're using addDistribution instead of AddDistribution in RbRegister_Dist.cpp for the phylo-specific function, is it possible that these are not doing the same thing ?

ms609 commented 1 month ago

Thanks! I see:

type(dnExp(lambda = 1))
# > Dist_exponential
x ~ dnExp(lambda = 1)
type(x)
# > RealPos

type(dnInverse(dnExp(lambda = 1)))
# > Dist_Inverse
invX ~ dnInverse(dnExp(lambda = 1))
type(invX)
# > RealPos

type(dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard"))
# > Dist_phyloCTMC

myPhyloCTMC ~ dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard")
type(myPhyloCTMC) 
# > AbstractHomologousDiscreteCharacterData

dnInverse(myPhyloCTMC)
#    Error:       No overloaded function 'dnInverse' matches for arguments ( AbstractHomologousDiscreteCharacterData )
#   Potentially matching functions are:
#   dnInverse (Distribution__Integer<any> distribution)
# [...]
# dnInverse (Distribution__AbstractHomologousDiscreteCharacterData<any> distribution)

This makes me wonder:

  1. is Distribution__AbstractHomologousDiscreteCharacterData equivalent to AbstractHomologousDiscreteCharacterData, or does it denote a different object? (Is there perhaps some distinction between RevBayesCore::AbstractHomologousDiscreteCharacterData and RevLanguage:AbstractHomologousDiscreteCharacterData?)
  2. How does the meaning and behaviour of dnInverse(dnPhyloCTMC(...)) differ from x ~ dnPhyloCTMC(...); dnInverse(x); and which is preferable?
  3. Am I failing to override a function in InversePhyloDistribution or Dist_InversePhylo that's stopping RevBayes recognizing the types of argument that can be handled?

addDistribution vs AddDistribution is a good shout, I couldn't get AddDistribution to work but will take another look.

bredelings commented 1 month ago

Note that with distributions "=" and "~" result in different types:

d = dnNormal(0,1)
x ~ dnNormal(0,1)
type(d)
type(x)
bredelings commented 1 month ago
        // AddDistribution< ModelVector<AbstractHomologousDiscreteCharacterData>  >( new Dist_Inverse< ModelVector<AbstractHomologousDiscreteCharacterData> >());

Possibly this line would work if you remove the ModelVector< >?

ms609 commented 1 month ago
AddDistribution< AbstractHomologousDiscreteCharacterData >( new Dist_InversePhylo() );

requires a couple of tweaks to the underlying class, but gives the same outcome.

AddDistribution< AbstractHomologousDiscreteCharacterData  >( new Dist_Inverse<AbstractHomologousDiscreteCharacterData >());

gives

 error: invalid new-expression of abstract class type 'RevBayesCore::AbstractHomologousDiscreteCharacterData'
[build]    24 |             : TypedDistribution<valType>( new valType() ),
[...]
because the following virtual functions are pure within 'RevBayesCore::AbstractHomologousDiscreteCharacterData':
 class AbstractHomologousDiscreteCharacterData : public HomologousCharacterData {
ms609 commented 1 month ago

Note that with distributions "=" and "~" result in different types:

d = dnNormal(0,1)
x ~ dnNormal(0,1)
type(d)
type(x)

Thanks, this is clarifying – so then dnInverse(dnNormal(0, 1)) is equivalent to dnInverse(d), rather than dnInverse(x). Which would be the more natural configuration for what I'm doing here? Neither quite seem to be a good fit to the syntax.

bredelings commented 1 month ago

I added a few tweaks and pushed the result to the main repo using the same branch name.

The following now works:

chars <- readDiscreteCharacterData("/home/bredelings/Work/5d-muscle.fasta")
taxa = chars.taxa()
tree ~ dnUniformTopologyBranchLength( taxa, branchLengthDistribution=dnExp(10.0) )
q_matrix <- fnJC(4)
x ~ dnInverse(dnPhyloCTMC(tree = tree, Q = q_matrix))
bredelings commented 1 month ago

Would you like to be added to the main repo?

ms609 commented 1 month ago

This sounds amazing, thank you! It sounds like it could be helpful to be added to the main repo – I'm finding it a bit fiddly to pull your changes through to my fork. Thank you.

bredelings commented 1 month ago

I think I've send you an invitation now.

ms609 commented 1 month ago

Thanks! I've accepted the invitation. And, excitingly, the dnInverse(dnPhyloCTMC()) is running for me too! I'll look forwards to exploring it in action tomorrow. Thanks for your help with this.

ms609 commented 1 month ago

I've had a bit more of a play with this now, and it's looking promising! Inference isn't yet behaving as I'd like it to (with some very low acceptance probabilities), but I suspect that this is an issue with the model / mcmc setup rather than the function.

One thing I've noticed is that print is not seeing "inside" the dnInverse function. Presumably I need to override a function to pass these parameters from base_distribution rather from the outer distribution, in which they are unset. Pointers to the relevant function would be welcome. I presume that this is only affecting the print function – but perhaps it has other consequences?

m_morph[2] ~ dnInverse( dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard"))
m_morph[2].clamp(condChars)
print(m_morph[2])
 character matrix with 16 taxa and 0 characters
===============================================
Origination:                   ""
Number of taxa:                16
Number of included taxa:       16
Number of characters:          0
Number of included characters: 0
Datatype:
m_morph[2] ~ dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard")
m_morph[2].clamp(condChars)
print(m_morph[2])

   Standard character matrix with 16 taxa and 474 characters
   =========================================================
   Origination:                   "k5obs4-cond.nex"
   Number of taxa:                16
   Number of included taxa:       16
   Number of characters:          474
   Number of included characters: 474
   Datatype:                      Standard
bjoelle commented 1 month ago

m_morph[2] is a draw from the distribution (from the operator ~) so it's of type CharacterData so it's normal that you're seeing the same thing regardless of what distribution is used

bjoelle commented 1 month ago

for printing the distribution itself I guess Dist_phyloCTMC::printValue may be what you're looking for ?

ms609 commented 1 month ago

Ah, but I'm not quite seeing the same thing: when passed through the inverse, the print function does not see the properties of the distribution, e.g. Number of characters: 0 (not 474); Origination: "" (not filename.nex).

Yes, the printValue method sounds like an obvious place to look, doesn't it! I'd been thinking of the problem from a different direction, so had missed that. Thanks.

bjoelle commented 1 month ago

that message appears to come directly from operator<< in AbstractCharacterData and it takes the number of characters, etc from the object so it's possible that dnInverse is not setting these properties correctly when constructing the object ?

ms609 commented 1 month ago

On closer inspection, I see that printValue constructs a parenthetical statement that I suspect is used in e.g. checkpoint files, whereas operator<< produces the summary table printed to screen.

printValue and operator<< each call functions such as x.getNumberOfTaxa(), which return the value of taxa.size() on the underlying AbstractCharacterData object. However, I don't see a way to access the raw taxa member variable when constructing an InverseDistribution, and I'm a little confused by the relationship between TypedDistribution and AbstractCharacterData – I can't quite see where the constructor should be setting these properties, or how. In fact, I'm not quite clear which constructor is responsible for this – I haven't quite got my head around the relationship between Dist_Inverse and InverseDistribution.