Weighting Different Components of a Trace's Score?

collinskatie commented 4 years ago

Hi, is there a way to weight different contributions of components to the trace differently in the score?

Specifically, I am sampling the number of (unseen) blocks from a distribution (n) parameterized by a bias towards sparse solutions. Based on these sampled blocks, the physics of the world are run forward to get out a trajectory (for which we have observations). However, it seems that the hit to the score of adding an unneeded block is very small compared to anything that changes the trajectory (since there are hundreds of individual trajectory points that are each assumed to be samples from a multi-variate x,y distribution). This seems to hamper the removal of unnecessary blocks during inference and doesn't make as intuitive sense in the score for our particular generative model? Is there a way in Gen to more strongly "weight" the block sampling choice (n) relative to the trajectory w/o using a very extreme sparsity parameter (like 0.001) to strongly penalize adding extra blocks?

Thanks for any help!!

bzinberg commented 4 years ago

Hi @collinskatie - good question, here are a few thoughts:

The meaning of the score is a log-probability-density, so there are a couple of broader questions that seem to underlie what you're encountering:

What are the correct intuitions about what comparing log-probability-densities means, from a Bayesian standpoint
How can inference be done over a transdimensional hypothesis space (i.e.: different regions of the hypothesis space have different numbers of continuous parameters, depending on what n is)

For (1), i.m.o. it's pretty tricky. If we try to compare the weights directly, there's an unreasonable dependence of one of the densities on a likelihood that corresponds to a block that doesn't exist in the other hypothesis. The fact the comparison is not invariant under, e.g.,

changes to the standard deviation of the modelled error in the likelihood
changes to the number of observations (i.e., there is a way to vary the number of observations such that one of the scores changes and the other one doesn't)

is a red flag that direct comparison of the scores is somewhat ad-hoc or at least not strictly Bayesian. There are meaningful relations between the scores, but they are more complicated. For example, it makes sense to compare the marginal probability of two different values of n. For example, the two integrals

Integral of exp(score) over all possible latent scene configurations with 1 block

and

Integral of exp(score) over all possible latent scene configurations with 3 blocks

represent probabilities of having 1 block vs. 3 blocks in the scene, and comparing them directly has a straightforward meaning. It's less clear how to compare densities in the transdimensional case, since they are defined over different base measures.

So I guess my question for you would be, what qualitative inference behavior are you hoping that a change of weight would lead to?

For (2), recommend checking out Gen's Involution MCMC capabilities, which are documented here.

collinskatie commented 4 years ago

Okay awesome! Thank you so much - that is all really helpful.

For the change in behavior, ideally, the model would favor or more strongly endorse configurations w/ fewer blocks, trading off b/w matching the observations in the trajectory and using as few blocks as possible. Right now, we are seeing that the model is sometimes adding unnecessary blocks and then not reliably pruning them, seemingly ignoring the sparsity prior and more strongly focusing on matching the trajectory (so it seems not to be balancing these two components). After seeing this behavior, I looked more at the individual scores and saw that changes in extra blocks was tiny compared to changes in the trajectory - so that's why we thought that maybe the scoring itself was problematic? But it seems that you're saying that those relative scores makes sense (i.e. we shouldn't try to reason at the individual score level?) We also currently are using kernels modified from the kernel_dsl.jl example in the repo: https://github.com/probcomp/Gen.jl/blob/master/examples/kernel_dsl.jl

Thanks again for your help!

bzinberg commented 4 years ago

One of the great things about the Bayesian approach is that the trade-off you describe, between model complexity and degree of data fit, are handled automatically by the so-called Bayesian Occam's razor (see e.g. Murray & Ghahramani 2005). This "automatically" is in contrast to explicit regularization methods, which play the analogous role in discriminative models such as neural networks.

Now, in order to get the right Occam's razor behavior behavior you need to have the right model. But in a way, that is just another diagnostic to catch when the model needs to be changed.

we shouldn't try to reason at the individual score level?

My rough intuition is, looking at the raw value of a single density (for a continuous or mixed discrete-continuous distribution) is usually not a good idea, because there is no good intuitive explanation and there are lots of bad but tempting-to-believe intuitions. Certain quantities derived from densities are meaningful, though:

Ratios of densities, if the densities are with respect to the same base measure. For example, if you fix the number of blocks and then consider the ratio of probability densities (or in log space, the difference of scores) between two different settings of the latent continuous state of the blocks, that ratio can be meaningfully interpreted as a comparison of how much the posterior favors one configuration versus the other. (It does not, however, indicate how probable either of the two configurations are in absolute terms, just which one is more probable and by how much.)
The integral of a density over a region of the hypothesis space that has positive measure (e.g. an interval if the hypothesis space is 1D, or a rectangle or ball if the hypothesis space is 2D, etc.). These are marginal probabilities. However the integrals are often intractable, so we often cannot compute these numbers directly.
The acceptance ratio of Reversible Jump (in Gen terms, Involution MCMC). This acceptance ratio (see Green 1995) is carefully constructed so that the comparison is meaningful, despite all the above challenges. Two key pieces of that construction are "dimension matching" (so that the densities being compared are over spaces of the same dimension) and a Jacobian correction (a needed adjustment to account for the fact that the base measures are different, even given that they're of the same dimension). Concretely, this means that rather than comparing the densities directly, it is more meaningful to look at the percentage of reversible jump moves that are accepted. If "remove" moves are very rarely accepted, that's a problem that should be diagnosed. E.g., it may be that the proposal needs to be changed because the proposal is too far from the posterior to have good convergence time. BTW, https://github.com/probcomp/Gen.jl/issues/250 and https://github.com/probcomp/Gen.jl/issues/115 will help with this debugging once they're implemented: they'll allow the user to look directly at the acceptance ratio rather than having to estimate it indirectly by tallying up the results of a bunch of moves.

collinskatie commented 4 years ago

That makes a lot of sense - I'll look through those resources! Thank you so much @bzinberg for your help!!

probcomp / Gen.jl

Weighting Different Components of a Trace's Score? #268