Not understand Unconstrained Representation in MCMC

tensorflow / probability

Probabilistic reasoning and statistical analysis in TensorFlow

https://www.tensorflow.org/probability/

Apache License 2.0

4.24k stars 1.09k forks source link

Not understand Unconstrained Representation in MCMC #584

Open zhulingchen opened 4 years ago

zhulingchen commented 4 years ago

Recently I am learning how to do MCMC with TFP using your Bayesian Gaussian Mixture Model example at https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Bayesian_Gaussian_Mixture_Model.ipynb.

One thing that has confused me for a long time is the purpose of having

unconstraining_bijectors = [
    tfb.SoftmaxCentered(),
    tfb.Identity(),
    tfb.Chain([
        tfb.TransformDiagonal(tfb.Softplus()),
        tfb.FillTriangular(),
    ])]

According to the descriptions mentioned above, "Hamiltonian Monte Carlo (HMC) requires the target log-probability function be differentiable with respect to its arguments. Furthermore, HMC can exhibit dramatically higher statistical efficiency if the state-space is unconstrained."

What I am not sure is this: does each element in unconstraining_bijectors refer to the bijector used for the corresponding element in initial_state?

i.e.: tfb.SoftmaxCentered() is used to transform the component weights tf.fill([components], value=np.array(1. / components, dtype), name='mix_probs'),

tfb.Identity() is used to transform the means tf.constant(np.array([[-2, -2], [0, 0], [2, 2]], dtype), as there is no need to do transform,

and tfb.Chain([tfb.TransformDiagonal(tfb.Softplus()), tfb.FillTriangular(),] is used to transform the Cholesky decomposition of the precision matrix (inversed covariance matrix) tf.eye(dims, batch_shape=[components], dtype=dtype, name='chol_precision')?

Please help me clarify. Thank you!

junpenglao commented 4 years ago

What I am not sure is this: does each element in unconstraining_bijectors refer to the bijector used for the corresponding element in initial_state?

Yes, your understand is correct. As you can see, the first and the last parameters are in a constrained space and need to map to a unconstrained space (i.e., so their domain is the Real line).

brianwa84 commented 4 years ago

We could replace that Chain with tfb.ScaleTriL().

The bijectors also remove (or enforce, depending which way you're thinking) the constraint that the upper right of a cholesky factor must be zeros, and the diagonal must be positive (in this case by mapping from free vectors in R^{n(n+1)/2} to chol factors).

If some latent values must live on a manifold, the bijectors are providing a smooth mapping from unconstrained reals to said manifold.

On Wed, Oct 2, 2019, 3:33 PM Junpeng Lao notifications@github.com wrote:

What I am not sure is this: does each element in unconstraining_bijectors refer to the bijector used for the corresponding element in initial_state?

Yes, your understand is correct. As you can see, the first and the last parameters are in a constrained space and need to map to a unconstrained space (i.e., so their domain is the Real line).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/584?email_source=notifications&email_token=AFJFSI6PLUOERQT6DNHBPNLQMTZQLA5CNFSM4I4X5MJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAF5CWQ#issuecomment-537645402, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI2MFJD6JHVRGVSOKB3QMTZQLANCNFSM4I4X5MJQ .

zhulingchen commented 4 years ago

We could replace that Chain with tfb.ScaleTriL(). The bijectors also remove (or enforce, depending which way you're thinking) the constraint that the upper right of a cholesky factor must be zeros, and the diagonal must be positive (in this case by mapping from free vectors in R^{n(n+1)/2} to chol factors). If some latent values must live on a manifold, the bijectors are providing a smooth mapping from unconstrained reals to said manifold. … On Wed, Oct 2, 2019, 3:33 PM Junpeng Lao @.***> wrote: What I am not sure is this: does each element in unconstraining_bijectors refer to the bijector used for the corresponding element in initial_state? Yes, your understand is correct. As you can see, the first and the last parameters are in a constrained space and need to map to a unconstrained space (i.e., so their domain is the Real line). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#584?email_source=notifications&email_token=AFJFSI6PLUOERQT6DNHBPNLQMTZQLA5CNFSM4I4X5MJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAF5CWQ#issuecomment-537645402>, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI2MFJD6JHVRGVSOKB3QMTZQLANCNFSM4I4X5MJQ .

Thanks Brian!

I thought unconstraining_bijectors as an enforcement. I am not sure if this thought would be correct or not and I don't know how it could be thought of as some kind of removal of constraints.

What I thought was: the random variables to be sampled was unconstrained (i.e., can take any value in the dtype field) in the sample chain, and after sampling, they are getting constrained by the bijectors. Do I think it correctly? How do I think with the "remove the constraint" way?

junpenglao commented 4 years ago

It is better to think of it as a transformation instead of an enforcement IMO, as it transform parameters that are unconstrained to a set of parameters that are met with the constrained.

zhulingchen commented 4 years ago

Thanks guys!

Another questions that bother me: why do we need unnormalized_posterior_log_prob = functools.partial(joint_log_prob, observations) to get a partial derivative of the joint log-probability with respect to the observations as target_log_prob_fn for the MCMC kernel in the sample chain?

brianwa84 commented 4 years ago

partial is a "partial closure" of a function in that case, not a partial derivative. It's binding the first argument of joint_log_prob so that whoever calls the resulting callable doesn't have to keep track of observations.

Brian Patton | Software Engineer | bjp@google.com

On Thu, Oct 3, 2019 at 9:29 AM Zhu, Lingchen notifications@github.com wrote:

Thanks guys!

Another questions that bother me: why do we need unnormalized_posterior_log_prob = functools.partial(joint_log_prob, observations) to get a partial derivative of the joint log-probability with respect to the observations as target_log_prob_fn for the MCMC kernel in the sample chain?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/584?email_source=notifications&email_token=AFJFSIYNKQVECTNNIFHW3CDQMXXU3A5CNFSM4I4X5MJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIGRCI#issuecomment-537946249, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI66ODXPPELSFRMUETDQMXXU3ANCNFSM4I4X5MJQ .

zhulingchen commented 4 years ago

We could replace that Chain with tfb.ScaleTriL().

I found an issue of tfb.ScaleTriL(). It seems only supporting tf.float32 but not tf.float64.

zhulingchen commented 4 years ago

partial is a "partial closure" of a function in that case, not a partial derivative. It's binding the first argument of joint_log_prob so that whoever calls the resulting callable doesn't have to keep track of observations. Brian Patton | Software Engineer | bjp@google.com … On Thu, Oct 3, 2019 at 9:29 AM Zhu, Lingchen @.***> wrote: Thanks guys! Another questions that bother me: why do we need unnormalized_posterior_log_prob = functools.partial(joint_log_prob, observations) to get a partial derivative of the joint log-probability with respect to the observations as target_log_prob_fn for the MCMC kernel in the sample chain? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#584?email_source=notifications&email_token=AFJFSIYNKQVECTNNIFHW3CDQMXXU3A5CNFSM4I4X5MJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIGRCI#issuecomment-537946249>, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFSI66ODXPPELSFRMUETDQMXXU3ANCNFSM4I4X5MJQ .

Speaking of the "partial closure" of a function, i.e., unnormalized_posterior_log_prob = functools.partial(joint_log_prob, observations) in the Bayesian Gaussian Mixture Model example, why don't we directly define the function unnormalized_posterior_log_prob without having the argument observations in the function definition but just using the outside variable observations (like a global variable) inside the function when calculating rv_observations.log_prob(observations)? Is it doable?

wcanetti commented 4 years ago

Hey Guys. I am reading and also trying to understand Bayesian Methods for Hackers book. Wanted to see if you can help me out undestand what this thing of the unconstrained space is. Not sure if it's a math thing, a computational thing or what.

In the following link, you can check the code. My doubt is why does alpha paremeter needs to be multiplied by 100? Is it a math thing? Then, which is the formula? How does he know you have to multiply it by 100 and not by 10? Even, why does it needs to be multiplied?

https://colab.research.google.com/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_TFP.ipynb#scrollTo=gp0QmuZvIA0L

Specific Section Since HMC operates over unconstrained space, we need to transform the samples so they live in real-space. Alpha is 100x of beta approximately, so apply Affine scalar bijector to multiply the unconstrained alpha by 100 to get back to the Challenger problem space

Many thanks in advance. Cheers, Walter.

brianwa84 commented 4 years ago

The simplest example would be, say, doing inference on the scale parameter of a normal distribution.

The scale parameter must be positive. But gradient-based methods want to operate on real, unconstrained values.

Suppose we write: %tensorflow_version 2.x import tensorflow as tf import tensorflow_probability as tfp tfd = tfp.distributions import numpy as np import matplotlib.pyplot as plt

true_scale = .2 observations = np.float32(np.random.randn(100) * true_scale) kernel = tfp.mcmc.HamiltonianMonteCarlo(lambda scale: tfd.Sample(tfd.Normal( 0., scale, validate_args=True), 100).log_prob(observations), num_leapfrog_steps=2, step_size=.05) chain_states = tf.constant(np.float32(np.random.randn(13)**2 + 1)) samples = tf.function(tfp.mcmc.sample_chain)(100, current_state=chain_states, kernel=kernel, num_burnin_steps=200, num_steps_between_results=10, trace_fn=None) plt.hist(samples.numpy().reshape(-1), bins=20, density=True);

We pretty quickly get an exception: InvalidArgumentError: assertion failed: [Argument scale must be positive.]

Adding a single line: kernel = tfp.mcmc.TransformedTransitionKernel(kernel, tfb.Softplus()) (between kernel= and chain_states=) gets us the ability to do inference over the manifold of positive floats pulled back through the softplus bijection, with the instantaneous change in volume properly accounted for.

This gets more interesting when you have parameters like a lower triangular scale matrix (tfb.FillScaleTriL()) or a correlation matrix (tfb.CorrelationCholesky()), each of which sits on a particular type of N-dimensional manifold in M-dimensional space. The bijector pulls back from this manifold to an unconstrained N-d vector of reals.

Brian Patton | Software Engineer | bjp@google.com

On Sat, Apr 4, 2020 at 10:15 PM wcanetti notifications@github.com wrote:

Hey Guys. Wanted to see if you can help me out undestand what this thing of the unconstrained space is. Not sure if it's a math thing, a computational thing or what.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/584#issuecomment-609123631, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFSI25DPQSMWJAQ7OFKEDRK7SV3ANCNFSM4I4X5MJQ .