Closed david-broman closed 1 year ago
I have quite many thoughts about this, which we can best discuss at a meeting. In summary, I strongly believe we should use sample
(possibly renamed to assume
, or something else more appropriate) and weight
unless we have very good reasons not to. This is the formalism used in the leading theoretical PPL papers and calculi, and captures the essence of encoding inference problems in PPLs in a very elegant way.
Maybe we require other formalisms for capturing static delayed sampling and collapsing, but I would then want to see an example which clearly demonstrates the limitations of sample
/assume
and weight
.
Some comments/questions:
in standard PPLs, it is only implicitly declared what we in the end would like to output, and what to marginalize out.
On the contrary, I would say this is very explicit in PPLs. Whatever is returned from our model is what we like to output, and any internal random variables are the ones marginalized out.
Also, I noted that in the above,
weight
is not a primitive, because it is specific to certain approximative algorithms, whereas it is not needed in exact algorithms (when this is possible to use).
I don't understand this. Why would weight be specific to certain approximative algorithms? It is a general-purpose construct that can be used for likelihood updating, subsuming, for instance, the more inflexible observe
.
Yes, we need to discuss this. We know that we can express observe
using weight
, but what I am looking at is the core fragment. When we use an approximate inference method, we can for sure include weight
, but I do not see why we should limit ourself to that when we design the different language fragments.
What I mean by implicit definitions of output variables are exactly what you say: if you want to know if a certain variable is an output variable, you need to statically analyze the program and figure it out. This can be easy or very hard, depending on the relationship between where the random variable is declared and where it says that it is an output (in our case, the infer
statement).
Anyway, let's discuss this tomorrow. It would be good if you could explain how to implement e.g. the sum-product algorithm using only weights, or how to do static analysis of conditional independency using only weights.
We can use this issue to document any insights and rationales for decisions.
Note also that the basic constructs in delayed sampling (the dynamic version) is using assume
, observe
, and value
as the source language constructs, and not sample
and weight
. Let's discuss tomorrow what this would mean if only weight
was available, and what implications of reconstructing observe
means (e.g., if it is possible in all cases).
I strongly agree on having different kinds of random variables. From my side, this will make the analysis of LDA like models much simpler but I also think that it will make the corePPL more generalisable (but mostly the former reason). If we were to make a hierarchical separation (probably lacking many components):
assume
?)value
)assume
)For both weight
and sample
, I think these terms more belong to the Monte Carlo methods but if the core of the language will be based on the Monte Carlo, then it makes sense just to include them in the core.
I think it is dangerous to start mixing graphical model concepts with general-purpose PPLs. Some models (e.g. LDA) can be encoded using graphical models, which allows for many powerful (but narrow) inference techniques over graphical models. But for the models expressed in a core PPL, this is in general not possible. Maybe you are looking for a separate language fragment restricted to graphical models?
It would be good if you could explain how to implement e.g. the sum-product algorithm using only weights, or how to do static analysis of conditional independency using only weights.
The sum-product algorithm only works on polytree graphical models where everything is conjugate (in practice, discrete and normal distributions). Such models can of course be encoded using weight
to model the likelihood updating using pdfs/pmfs (aka observe
). So, I guess my answer for how to implement it with weight
is: the usual algorithm. Maybe the model can be transformed into a simpler graphical model AST before applying inference, if needed.
With static analysis of conditional independences, do you mean applying d-separation to the model? Doing this statically in general for any program is impossible (even if we only use observe
) unless your model corresponds to a graphical model, since the dependence relationships can change between runs. Doing it dynamically is possible, like we did with the runtime graph for dynamic delayed sampling.
Note also that the basic constructs in delayed sampling (the dynamic version) is using
assume
,observe
, andvalue
as the source language constructs, and not sample and weight.
We could equally have used sample
/assume
and weight
here instead, but we had not discovered weight
yet when working on this. If I would have written this paper today, I would have used sample
and weight
as the core constructs. value
is an inference annotation (like we use resample
for SMC), indicating that a delayed variable in the delayed sampling runtime graph must be realized immediately.
For both
weight
andsample
, I think these terms more belong to the Monte Carlo methods but if the core of the language will be based on the Monte Carlo, then it makes sense just to include them in the core.
As I mentioned before, I disagree with this. sample
/assume
and weight
do not correspond to any particular inference algorithm, but are general-purpose modeling constructs. It would be great if you could show an example where the limitations of sample
and weight
are demonstrated (maybe LDA).
I think it is dangerous to start mixing graphical model concepts with general-purpose PPLs. Some models (e.g. LDA) can be encoded using graphical models, which allows for many powerful (but narrow) inference techniques over graphical models. But for the models expressed in a core PPL, this is in general not possible. Maybe you are looking for a separate language fragment restricted to graphical models?
Right, but I think in any case having different types of random variables is more general concept and do not depend on an existence of a graphical model.
As I mentioned before, I disagree with this. sample/assume and weight do not correspond to any particular inference algorithm, but are general-purpose modelling constructs. It would be great if you could show an example where the limitations of sample and weight are demonstrated (maybe LDA).
I do not think there would be a limitation including these constructs (thinking this is a design choice); however, I doubt these concepts are widely used in variational inference+MAP or even MCMC. Especially, the term weight
is a bit ambiguous between inference algorithms because in SMC weight
is associated with particles, however in more general sense it may mean assigning a value to a random variable in a factor. But, I agree that weight
is more general than the observe
.
A key design objective of the whole DPPL work is to be able to have specialized algorithms for special model, and then to backtrack to more general approaches if needed. Hence, if it is possible to derive that it is a graphical model from the PPL, we should be able to utilize this (e.g. in the case of LDA). Hence, I think we should design this into several language fragments, where some fragments are needed for certain algorithms.
Yes, it sounds reasonable to have different language fragments.
@gizemcaylak
Right, but I think in any case having different types of random variables is more general concept and do not depend on an existence of a graphical model.
Yes, maybe it is needed for what you're working on. I'm not quite sure what is meant by "different types" of random variables yet. Do you mean different types at the type level (statically), or different runtime representations?
I guess we have now settled on the above?
Correct me if I'm wrong, but the decision was:
assume
as the fundamental randomness construct. I still prefer the name sample
for consistency with existing PPL literature, but I believe both @david and @gizem prefer assume
.weight
, observe
, condition
, etc., so that any combination can be used when building a PPL in Miking.Should I close this issue?
Ping @david-broman @gizemcaylak
Closing this issue now, I believe it is resolved.
I have been thinking, and I am wondering if
sample
andweight
really should be the core terms of CorePPL. If we look at one simulation run, when performing inference using SMC or some variant of MCMC, thesample
term might be a good name and intuition, but looking at it from a Bayesian statistical sense, it is all about random variables that are at different stages.I think it makes sense to distinguish between three kinds of random variables (each associated with a specific distribution).
orv
. These are the random variables that are observed with a fixed value (sometimes called evidence nodes in Bayesian networks).lrv
. These are random variables that we would like to compute the conditional probability for, given certain observed random variables.mrv
that we would like to marginalize out.Using the notation by M. Jorden (see chapter https://people.eecs.berkeley.edu/~jordan/prelims/), F is the set of latent variables,
E
is the set of observed random variables (evidence nodes), andR = V \ (E \cup F)
is the set of variables that should be marginalized out. Note thatV
is the set of all random variables.The observer random variables
orv
corresponds to theobserve
statement in standard PPLs, whereas thesample
statement corresponds to the union of all latent random variables and marginalized random variables. I would like to discuss this, because in standard PPLs, it is only implicitly declared what we in the end would like to output, and what to marginalize out.To make static delayed sampling correct (and collapsing in e.g. LDA), I think it is important to be clear about this. Also, I noted that in the above,
weight
is not a primitive, because it is specific to certain approximative algorithms, whereas it is not needed in exact algorithms (when this is possible to use). This is another thing that I would like to discuss.