Open ms609 opened 1 month ago
can you clarify what you mean by conditional
? if I follow your proposal right, this would only work for conditions that can be expressed as a specific outcome (x2
in your example), is that right ?
for the specific question here, it seems that this could be done through a function that would be applied to the distribution ? so you would have conditional ~ fnInverse(dnPhyloCTMC(....))
and then fnInverse
would work in a generic way
Yes, an fnInverse
function sounds promising.
A simple example would be calculating the probability of an observation from a normal distribution with mean zero, conditioned on only observing positive outcomes. Unconditioned outcomes are equally likely to be positive or negative; so the likelihood of a specific positive value of x (say 0.42) is twice as high if we condition on x > 0 as it would be without conditioning.
In this simple case, we'd see
xObs <- 0.42
normal ~ dnNormal(mean = 0, sd = 1)
normal.clamp(xObs)
pXPositive = 0.5 # Because we can ignore the 50% of draws from dnNormal that are not positive
likelihood = normal.probability() / pXPositive
In this case our condition is constant; perhaps more analogous to my example would be a case where the condition depends on a variable: perhaps we have a second variable xMin
that provides a minimum observable value for xObs
, so x
is known to be greater than xMin
. In this case, our pXPositive could be replaced by p(x > xMin)
.
This seems to be a slightly use case than what you presented earlier ? you are not clamping the condition to anything here
In this situation the easiest may be to reject outcomes which do not fulfill the condition ? this is how we deal with fossil age uncertainty for instance, by adding another distribution which will give likelihood=0
if the sampled fossil age is outside the range
Yes, that's a somewhat simplified example; the actual use case is somewhat more complicated to explain, and I'm finding it difficult to construe an artificial example that has all the features of my real use case, without disclosing details that aren't mine to put in the public domain.
I think the initial example captures the key features: the model likelihood is obtained by dividing the probability of observing morphological characters x1
under dnPhyloCTMC1 by the probability of observing 'control' characters x2
under dnPhyloCTMC2. Perhaps I should articulate that whilst the characters x1
and x2
do not change, the rate matrices passed to dnPhyloCTMC1 dnPhyloCTMC2 rely on a separate parameter under inference.
I hope this isn't too obscure to capture what I'm trying to do – sorry not to be able to give a complete picture of the problem!
no problem, just trying to get a clear picture of what you're trying to achieve :)
in that case I think nesting dnPhyloCTMC2
in an inverse function as I proposed above is probably the simplest/most portable and it would allow this to be done with any other distribution if needed - we do this type of structure already to add topology constraints to any tree priors for instance
Great, thanks! I will work through the implementation docs and see how I get on; are there any existing functions that spring to mind as an obvious template I could use?
I think PR #466 is the closest example to what you want to do
So, specifically, we might want to write something like
x ~ dnNormal(0,1) | x > 0
In general, we might have something like
x ~ distribution | condition
The probability would then be
Pr(x and condition)/Pr(condition) = Pr(x) * Pr(condition|x) / Pr(condition).
If condition is a boolean function of x, such as x > 0
, then Pr(condition|x)
reduces to 1 if condition(x) else 0
.
The tricky part is Pr(condition(x))
. This depends on the distribution. That is Pr(x > 0)
equals 1/2
under Normal(0,1)
, but equals 1
under Gamma(2,2)
. So, whether or not we can handle a particular condition depends on whether the distribution "understands" that condition.
Previously, the only case we have handled conditions is INSIDE the phyloctmc function. That is, we have some conditions on whether specific characters are variable. And maybe some more complicated conditions for fossil morphology that @wrightaprilm might know about?
Also, @ms609, you might want to have a look at src/core/distributions/phylogenetics/substitution/PhyloCTMCSiteHomogeneousConditional.h
And src/revlanguage/distributions/phylogenetics/character/Dist_phyloCTMC.cpp
. The coding
option to phyloCTMC
specifies ascertainment conditions. For binary options or restriction enzymes, the options are "noabsencesites", "nopresencesites", "informative", "variable", "nosingletonpresence", "nosingletonabsence", "nosingletons", and "all". For other alphabets, the options are just "informative", "variable", and "all".
I've made some headway here. I have implemented dnInverse
such that
x ~ dnExp(lambda = 1)
x.clamp(42)
x.lnProbability() # = -42
# Now calculate the inverse
invX ~ dnInverse(dnExp(lambda = 1))
# Clamp the value of the draw from the exponential distribution
invX.clamp(42)
invX.lnProbability() # = 42
However, when I attempt dnInverse(dnPhyloCTMC(...))
, I hit an issue. I was expecting dnPhyloCTMC
to return an object of class Distribution__AbstractHomologousDiscreteCharacterData
, per its documented domain type, analogous to dnExp
returning Distribution__RealPos
. Instead I see:
Error: No overloaded function 'dnInverse' matches for arguments ( Dist_phyloCTMC )
Potentially matching functions are:
dnInverse (Distribution__Integer<any> distribution)
dnInverse (Distribution__Natural<any> distribution)
dnInverse (Distribution__Probability<any> distribution)
dnInverse (Distribution__Real<any> distribution)
dnInverse (Distribution__RealPos<any> distribution)
[...]
dnInverse (Distribution__AbstractHomologousDiscreteCharacterData<any> distribution)
What function do I need to implement in order to pass a phyloCTMC object to dnInverse? Distribution__AbstractHomologousDiscreteCharacterData[]
, perhaps?
Sorry if I'm not explaining myself very clearly, as a non-C++ native, I've faced a bit of a learning curve to get my head around the code. If it helps, you can see my code here.
type(x)
should give you the type of object x
I believe ?
also you're using addDistribution
instead of AddDistribution
in RbRegister_Dist.cpp
for the phylo-specific function, is it possible that these are not doing the same thing ?
Thanks! I see:
type(dnExp(lambda = 1))
# > Dist_exponential
x ~ dnExp(lambda = 1)
type(x)
# > RealPos
type(dnInverse(dnExp(lambda = 1)))
# > Dist_Inverse
invX ~ dnInverse(dnExp(lambda = 1))
type(invX)
# > RealPos
type(dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard"))
# > Dist_phyloCTMC
myPhyloCTMC ~ dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard")
type(myPhyloCTMC)
# > AbstractHomologousDiscreteCharacterData
dnInverse(myPhyloCTMC)
# Error: No overloaded function 'dnInverse' matches for arguments ( AbstractHomologousDiscreteCharacterData )
# Potentially matching functions are:
# dnInverse (Distribution__Integer<any> distribution)
# [...]
# dnInverse (Distribution__AbstractHomologousDiscreteCharacterData<any> distribution)
This makes me wonder:
RevBayesCore::AbstractHomologousDiscreteCharacterData
and RevLanguage:AbstractHomologousDiscreteCharacterData
?)dnInverse(dnPhyloCTMC(...))
differ from x ~ dnPhyloCTMC(...); dnInverse(x)
; and which is preferable?addDistribution
vs AddDistribution
is a good shout, I couldn't get AddDistribution to work but will take another look.
Note that with distributions "=" and "~" result in different types:
d = dnNormal(0,1)
x ~ dnNormal(0,1)
type(d)
type(x)
// AddDistribution< ModelVector<AbstractHomologousDiscreteCharacterData> >( new Dist_Inverse< ModelVector<AbstractHomologousDiscreteCharacterData> >());
Possibly this line would work if you remove the ModelVector< >
?
AddDistribution< AbstractHomologousDiscreteCharacterData >( new Dist_InversePhylo() );
requires a couple of tweaks to the underlying class, but gives the same outcome.
AddDistribution< AbstractHomologousDiscreteCharacterData >( new Dist_Inverse<AbstractHomologousDiscreteCharacterData >());
gives
error: invalid new-expression of abstract class type 'RevBayesCore::AbstractHomologousDiscreteCharacterData'
[build] 24 | : TypedDistribution<valType>( new valType() ),
[...]
because the following virtual functions are pure within 'RevBayesCore::AbstractHomologousDiscreteCharacterData':
class AbstractHomologousDiscreteCharacterData : public HomologousCharacterData {
Note that with distributions "=" and "~" result in different types:
d = dnNormal(0,1) x ~ dnNormal(0,1) type(d) type(x)
Thanks, this is clarifying – so then dnInverse(dnNormal(0, 1))
is equivalent to dnInverse(d)
, rather than dnInverse(x)
. Which would be the more natural configuration for what I'm doing here? Neither quite seem to be a good fit to the syntax.
I added a few tweaks and pushed the result to the main repo using the same branch name.
type( )
so that type(x,TRUE)
shows you the typespec. This would have shown what was wrong with dnPhyloCTMCThe following now works:
chars <- readDiscreteCharacterData("/home/bredelings/Work/5d-muscle.fasta")
taxa = chars.taxa()
tree ~ dnUniformTopologyBranchLength( taxa, branchLengthDistribution=dnExp(10.0) )
q_matrix <- fnJC(4)
x ~ dnInverse(dnPhyloCTMC(tree = tree, Q = q_matrix))
Would you like to be added to the main repo?
This sounds amazing, thank you! It sounds like it could be helpful to be added to the main repo – I'm finding it a bit fiddly to pull your changes through to my fork. Thank you.
I think I've send you an invitation now.
Thanks! I've accepted the invitation. And, excitingly, the dnInverse(dnPhyloCTMC()) is running for me too! I'll look forwards to exploring it in action tomorrow. Thanks for your help with this.
I've had a bit more of a play with this now, and it's looking promising! Inference isn't yet behaving as I'd like it to (with some very low acceptance probabilities), but I suspect that this is an issue with the model / mcmc setup rather than the function.
One thing I've noticed is that print
is not seeing "inside" the dnInverse
function. Presumably I need to override a function to pass these parameters from base_distribution
rather from the outer distribution, in which they are unset. Pointers to the relevant function would be welcome. I presume that this is only affecting the print function – but perhaps it has other consequences?
m_morph[2] ~ dnInverse( dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard"))
m_morph[2].clamp(condChars)
print(m_morph[2])
character matrix with 16 taxa and 0 characters
===============================================
Origination: ""
Number of taxa: 16
Number of included taxa: 16
Number of characters: 0
Number of included characters: 0
Datatype:
m_morph[2] ~ dnPhyloCTMC(tree = phylogeny, Q = Q4cond, type = "Standard")
m_morph[2].clamp(condChars)
print(m_morph[2])
Standard character matrix with 16 taxa and 474 characters
=========================================================
Origination: "k5obs4-cond.nex"
Number of taxa: 16
Number of included taxa: 16
Number of characters: 474
Number of included characters: 474
Datatype: Standard
m_morph[2]
is a draw from the distribution (from the operator ~
) so it's of type CharacterData
so it's normal that you're seeing the same thing regardless of what distribution is used
for printing the distribution itself I guess Dist_phyloCTMC::printValue
may be what you're looking for ?
Ah, but I'm not quite seeing the same thing: when passed through the inverse, the print function does not see the properties of the distribution, e.g. Number of characters: 0
(not 474
); Origination: ""
(not filename.nex
).
Yes, the printValue
method sounds like an obvious place to look, doesn't it! I'd been thinking of the problem from a different direction, so had missed that. Thanks.
that message appears to come directly from operator<<
in AbstractCharacterData
and it takes the number of characters, etc from the object so it's possible that dnInverse
is not setting these properties correctly when constructing the object ?
On closer inspection, I see that printValue
constructs a parenthetical statement that I suspect is used in e.g. checkpoint files, whereas operator<<
produces the summary table printed to screen.
printValue
and operator<<
each call functions such as x.getNumberOfTaxa()
, which return the value of taxa.size()
on the underlying AbstractCharacterData
object. However, I don't see a way to access the raw taxa
member variable when constructing an InverseDistribution, and I'm a little confused by the relationship between TypedDistribution and AbstractCharacterData – I can't quite see where the constructor should be setting these properties, or how. In fact, I'm not quite clear which constructor is responsible for this – I haven't quite got my head around the relationship between Dist_Inverse
and InverseDistribution
.
The problem
I am developing a model in which I observe data x under the model M(θ) with parameter θ, and know that the data x satisfy condition C.
I wish to perform inference under the conditional, which requires the calculation of
π(x | θ, C) = Pr(x | θ) / Pr({X satisfies the conditional} | θ)
Pr(x | θ) can be computed with
Pr({X satisfies C} | θ) is given by
Hence π(x | θ, C) is given by
So far, so good. Now I wish to run an MCMC(MC) using the conditioned likelihood.
The solution
The key function is
getModelLnProbability
insrc/core/analysis/mcmc/Mcmc.cpp
(and its counterpart inMcmcmc.cpp
).The key question here is how to identify a distribution as a conditional. We need to either an additional
if
statement to replacepp +=
withpp -=
when a node is identified as representing a conditional distribution, or to design a conditional node such that itsgetLnProbability()
method returns a negative value.Possibilities include:
model()
object that identifies nodes as conditional. This seems like a non-starter – it is too late in the workflow, and would require a dramatic change to how the user interacts with that function; specifying an arbitrary node from the model graph would no longer suffice.dnPhyloCTMC
(and, ultimately, all distribution functions) that returns the inverse probability. This feels cumbersome and would introduce significant duplication to the code base.inverse
argument to all distribution functions, with a default value ofFALSE
, whereby the value returned by thegetLnProbability()
method is negated ifinverse = TRUE
.The latter approach seems the most promising; it does not bloat the number of functions and allows the user to define the probability distribution as representing a conditional at its point of definition. It also avoids any requirement to modify other functions relying on probability calculation (e.g. move proposals; monitors).
Next steps
I've spent a while trying to install RevBayes from source, to no avail (yet).
As my familiarity with the RevBayes code structure, and with C++ more generally, is somewhat limited, it would be great to be able to get support from people better acquainted with RevBayes! Specifically, is the proposed implementation reasonable, or would another approach be better?
I don't know whether the task is trivial enough that someone in the development team might be able to help with the implementation? If not, I've seen (but not yet digested) the documentation on creating new functions, and best practice, but haven't quite got my head around how this fits in to the structure of specific code; an overview of what would be involved would be helpful.