miking-lang / miking-dppl

MIT License
15 stars 14 forks source link

Basic terms in CorePPL. #18

Closed david-broman closed 1 year ago

david-broman commented 3 years ago

I have been thinking, and I am wondering if sample and weight really should be the core terms of CorePPL. If we look at one simulation run, when performing inference using SMC or some variant of MCMC, the sample term might be a good name and intuition, but looking at it from a Bayesian statistical sense, it is all about random variables that are at different stages.

I think it makes sense to distinguish between three kinds of random variables (each associated with a specific distribution).

  1. An Observed Random Variables, orv. These are the random variables that are observed with a fixed value (sometimes called evidence nodes in Bayesian networks).
  2. A Latent Random Variables lrv. These are random variables that we would like to compute the conditional probability for, given certain observed random variables.
  3. A Marginalized Random Variable mrv that we would like to marginalize out.

Using the notation by M. Jorden (see chapter https://people.eecs.berkeley.edu/~jordan/prelims/), F is the set of latent variables, E is the set of observed random variables (evidence nodes), and R = V \ (E \cup F) is the set of variables that should be marginalized out. Note that V is the set of all random variables.

The observer random variables orv corresponds to the observe statement in standard PPLs, whereas the sample statement corresponds to the union of all latent random variables and marginalized random variables. I would like to discuss this, because in standard PPLs, it is only implicitly declared what we in the end would like to output, and what to marginalize out.

To make static delayed sampling correct (and collapsing in e.g. LDA), I think it is important to be clear about this. Also, I noted that in the above, weight is not a primitive, because it is specific to certain approximative algorithms, whereas it is not needed in exact algorithms (when this is possible to use). This is another thing that I would like to discuss.

dlunde commented 3 years ago

I have quite many thoughts about this, which we can best discuss at a meeting. In summary, I strongly believe we should use sample (possibly renamed to assume, or something else more appropriate) and weight unless we have very good reasons not to. This is the formalism used in the leading theoretical PPL papers and calculi, and captures the essence of encoding inference problems in PPLs in a very elegant way.

Maybe we require other formalisms for capturing static delayed sampling and collapsing, but I would then want to see an example which clearly demonstrates the limitations of sample/assume and weight.

Some comments/questions:

in standard PPLs, it is only implicitly declared what we in the end would like to output, and what to marginalize out.

On the contrary, I would say this is very explicit in PPLs. Whatever is returned from our model is what we like to output, and any internal random variables are the ones marginalized out.

Also, I noted that in the above, weight is not a primitive, because it is specific to certain approximative algorithms, whereas it is not needed in exact algorithms (when this is possible to use).

I don't understand this. Why would weight be specific to certain approximative algorithms? It is a general-purpose construct that can be used for likelihood updating, subsuming, for instance, the more inflexible observe.

david-broman commented 3 years ago

Yes, we need to discuss this. We know that we can express observe using weight, but what I am looking at is the core fragment. When we use an approximate inference method, we can for sure include weight, but I do not see why we should limit ourself to that when we design the different language fragments.

What I mean by implicit definitions of output variables are exactly what you say: if you want to know if a certain variable is an output variable, you need to statically analyze the program and figure it out. This can be easy or very hard, depending on the relationship between where the random variable is declared and where it says that it is an output (in our case, the infer statement).

Anyway, let's discuss this tomorrow. It would be good if you could explain how to implement e.g. the sum-product algorithm using only weights, or how to do static analysis of conditional independency using only weights.

We can use this issue to document any insights and rationales for decisions.

david-broman commented 3 years ago

Note also that the basic constructs in delayed sampling (the dynamic version) is using assume, observe, and value as the source language constructs, and not sample and weight. Let's discuss tomorrow what this would mean if only weight was available, and what implications of reconstructing observe means (e.g., if it is possible in all cases).

gizemcaylak commented 3 years ago

I strongly agree on having different kinds of random variables. From my side, this will make the analysis of LDA like models much simpler but I also think that it will make the corePPL more generalisable (but mostly the former reason). If we were to make a hierarchical separation (probably lacking many components):

For both weight and sample, I think these terms more belong to the Monte Carlo methods but if the core of the language will be based on the Monte Carlo, then it makes sense just to include them in the core.

dlunde commented 3 years ago

I think it is dangerous to start mixing graphical model concepts with general-purpose PPLs. Some models (e.g. LDA) can be encoded using graphical models, which allows for many powerful (but narrow) inference techniques over graphical models. But for the models expressed in a core PPL, this is in general not possible. Maybe you are looking for a separate language fragment restricted to graphical models?

It would be good if you could explain how to implement e.g. the sum-product algorithm using only weights, or how to do static analysis of conditional independency using only weights.

The sum-product algorithm only works on polytree graphical models where everything is conjugate (in practice, discrete and normal distributions). Such models can of course be encoded using weight to model the likelihood updating using pdfs/pmfs (aka observe). So, I guess my answer for how to implement it with weight is: the usual algorithm. Maybe the model can be transformed into a simpler graphical model AST before applying inference, if needed.

With static analysis of conditional independences, do you mean applying d-separation to the model? Doing this statically in general for any program is impossible (even if we only use observe) unless your model corresponds to a graphical model, since the dependence relationships can change between runs. Doing it dynamically is possible, like we did with the runtime graph for dynamic delayed sampling.

Note also that the basic constructs in delayed sampling (the dynamic version) is using assume, observe, and value as the source language constructs, and not sample and weight.

We could equally have used sample/assume and weight here instead, but we had not discovered weight yet when working on this. If I would have written this paper today, I would have used sample and weight as the core constructs. value is an inference annotation (like we use resample for SMC), indicating that a delayed variable in the delayed sampling runtime graph must be realized immediately.

For both weight and sample, I think these terms more belong to the Monte Carlo methods but if the core of the language will be based on the Monte Carlo, then it makes sense just to include them in the core.

As I mentioned before, I disagree with this. sample/assume and weight do not correspond to any particular inference algorithm, but are general-purpose modeling constructs. It would be great if you could show an example where the limitations of sample and weight are demonstrated (maybe LDA).

gizemcaylak commented 3 years ago

I think it is dangerous to start mixing graphical model concepts with general-purpose PPLs. Some models (e.g. LDA) can be encoded using graphical models, which allows for many powerful (but narrow) inference techniques over graphical models. But for the models expressed in a core PPL, this is in general not possible. Maybe you are looking for a separate language fragment restricted to graphical models?

Right, but I think in any case having different types of random variables is more general concept and do not depend on an existence of a graphical model.

As I mentioned before, I disagree with this. sample/assume and weight do not correspond to any particular inference algorithm, but are general-purpose modelling constructs. It would be great if you could show an example where the limitations of sample and weight are demonstrated (maybe LDA).

I do not think there would be a limitation including these constructs (thinking this is a design choice); however, I doubt these concepts are widely used in variational inference+MAP or even MCMC. Especially, the term weight is a bit ambiguous between inference algorithms because in SMC weight is associated with particles, however in more general sense it may mean assigning a value to a random variable in a factor. But, I agree that weight is more general than the observe.

david-broman commented 3 years ago

A key design objective of the whole DPPL work is to be able to have specialized algorithms for special model, and then to backtrack to more general approaches if needed. Hence, if it is possible to derive that it is a graphical model from the PPL, we should be able to utilize this (e.g. in the case of LDA). Hence, I think we should design this into several language fragments, where some fragments are needed for certain algorithms.

dlunde commented 3 years ago

Yes, it sounds reasonable to have different language fragments.

@gizemcaylak

Right, but I think in any case having different types of random variables is more general concept and do not depend on an existence of a graphical model.

Yes, maybe it is needed for what you're working on. I'm not quite sure what is meant by "different types" of random variables yet. Do you mean different types at the type level (statically), or different runtime representations?

dlunde commented 3 years ago

I guess we have now settled on the above?

Correct me if I'm wrong, but the decision was:

Should I close this issue?

dlunde commented 3 years ago

Ping @david-broman @gizemcaylak

dlunde commented 1 year ago

Closing this issue now, I believe it is resolved.