pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.72k stars 2.02k forks source link

Wishart fails with valid arguments #538

Closed fonnesbeck closed 8 years ago

fonnesbeck commented 10 years ago

Models that use the Wishart as a prior currently fail with valid parameters. For example,

s = pm.Wishart('s', 4, 3, np.eye(3))

Fails with the following:

TypeError: Wrong number of dimensions: expected 0, got 2 with shape (3, 3)

I've tried a variety of 3x3 matrices, but all yield similar errors.

PS - I have a pull request that removed the redundant second parameter.

jsalvatier commented 10 years ago

I think the issue is that you didn't specify 'shape'. That's definitely something we should be able to infer thought.

On Fri, May 2, 2014 at 12:27 PM, Chris Fonnesbeck notifications@github.comwrote:

Models that use the Wishart as a prior currently fail with valid parameters. For example,

s = pm.Wishart('s', 4, 3, np.eye(3))

Fails with the following:

TypeError: Wrong number of dimensions: expected 0, got 2 with shape (3, 3)

I've tried a variety of 3x3 matrices, but all yield similar errors.

— Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/538 .

fonnesbeck commented 10 years ago

Yeah, I tried that, and got:

---------------------------------------------------------------------------
AsTensorError                             Traceback (most recent call last)
<ipython-input-73-6f02cca3962c> in <module>()
      1 with pm.Model():
----> 2     s = pm.Wishart('s', 4, 3, np.eye(3), shape=(3,3))

/Users/fonnescj/Code/pymc/pymc/distributions/distribution.pyc in __new__(cls, name, *args, **kwargs)
     17             data = kwargs.pop('observed', None)
     18             dist = cls.dist(*args, **kwargs)
---> 19             return model.Var(name, dist, data)
     20         elif name is None:
     21             return object.__new__(cls) #for pickle

/Users/fonnescj/Code/pymc/pymc/model.pyc in Var(self, name, dist, data)
    142         """
    143         if data is None:
--> 144             var = FreeRV(name=name, distribution=dist, model=self)
    145             self.free_RVs.append(var)
    146         else:

/Users/fonnescj/Code/pymc/pymc/model.pyc in __init__(self, type, owner, index, name, distribution, model)
    321             self.tag.test_value = np.ones(
    322                 distribution.shape, distribution.dtype) * distribution.default()
--> 323             self.logp_elemwiset = distribution.logp(self)
    324             self.model = model
    325 

/Users/fonnescj/Code/pymc/pymc/distributions/multivariate.pyc in logp(self, X)
    179              2) - n * log(IVI) - 2 * multigammaln(p, n / 2)) / 2,
    180 
--> 181             all(n > p - 1))

/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in all(x, axis, keepdims)
   5078 
   5079 def all(x, axis=None, keepdims=False):
-> 5080     out = elemwise.All(axis)(x)
   5081 
   5082     if keepdims:

/Library/Python/2.7/site-packages/theano/gof/op.pyc in __call__(self, *inputs, **kwargs)
    409         """
    410         return_list = kwargs.pop('return_list', False)
--> 411         node = self.make_node(*inputs, **kwargs)
    412         if self.add_stack_trace_on_call:
    413             self.add_tag_trace(node)

/Library/Python/2.7/site-packages/theano/tensor/elemwise.pyc in make_node(self, input)
   1639 
   1640     def make_node(self, input):
-> 1641         input = as_tensor_variable(input)
   1642         if input.dtype not in ["int8", "uint8"]:
   1643             input = theano.tensor.neq(input, 0)

/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in as_tensor_variable(x, name, ndim)
    181     if isinstance(x, bool):
    182         raise AsTensorError(
--> 183             "Cannot cast True or False as a tensor variable. Please use 1 or "
    184             "0. This error might be caused by using the == operator on "
    185             "Variables. v == w does not do what you think it does, "

AsTensorError: Cannot cast True or False as a tensor variable. Please use 1 or 0. This error might be caused by using the == operator on Variables. v == w does not do what you think it does, use theano.tensor.eq(v, w) instead.

It does not seem to like the boundary check.

jsalvatier commented 10 years ago

Hmm that seems bad. Do we have tests for it? On May 2, 2014 6:19 PM, "Chris Fonnesbeck" notifications@github.com wrote:

Yeah, I tried that, and got:


AsTensorError Traceback (most recent call last)

in () 1 with pm.Model(): ----> 2 s = pm.Wishart('s', 4, 3, np.eye(3), shape=(3,3)) /Users/fonnescj/Code/pymc/pymc/distributions/distribution.pyc in **new**(cls, name, _args, *_kwargs) 17 data = kwargs.pop('observed', None) 18 dist = cls.dist(_args, *_kwargs) ---> 19 return model.Var(name, dist, data) 20 elif name is None: 21 return object.**new**(cls) #for pickle /Users/fonnescj/Code/pymc/pymc/model.pyc in Var(self, name, dist, data) 142 """ 143 if data is None: --> 144 var = FreeRV(name=name, distribution=dist, model=self) 145 self.free_RVs.append(var) 146 else: /Users/fonnescj/Code/pymc/pymc/model.pyc in **init**(self, type, owner, index, name, distribution, model) 321 self.tag.test_value = np.ones( 322 distribution.shape, distribution.dtype) \* distribution.default() --> 323 self.logp_elemwiset = distribution.logp(self) 324 self.model = model 325 /Users/fonnescj/Code/pymc/pymc/distributions/multivariate.pyc in logp(self, X) 179 2) - n \* log(IVI) - 2 \* multigammaln(p, n / 2)) / 2, 180 --> 181 all(n > p - 1)) /Library/Python/2.7/site-packages/theano/tensor/basic.pyc in all(x, axis, keepdims) 5078 5079 def all(x, axis=None, keepdims=False): -> 5080 out = elemwise.All(axis)(x) 5081 5082 if keepdims: /Library/Python/2.7/site-packages/theano/gof/op.pyc in **call**(self, _inputs, *_kwargs) 409 """ 410 return_list = kwargs.pop('return_list', False) --> 411 node = self.make_node(_inputs, *_kwargs) 412 if self.add_stack_trace_on_call: 413 self.add_tag_trace(node) /Library/Python/2.7/site-packages/theano/tensor/elemwise.pyc in make_node(self, input) 1639 1640 def make_node(self, input): -> 1641 input = as_tensor_variable(input) 1642 if input.dtype not in ["int8", "uint8"]: 1643 input = theano.tensor.neq(input, 0) /Library/Python/2.7/site-packages/theano/tensor/basic.pyc in as_tensor_variable(x, name, ndim) 181 if isinstance(x, bool): 182 raise AsTensorError( --> 183 "Cannot cast True or False as a tensor variable. Please use 1 or " 184 "0. This error might be caused by using the == operator on " 185 "Variables. v == w does not do what you think it does, " AsTensorError: Cannot cast True or False as a tensor variable. Please use 1 or 0. This error might be caused by using the == operator on Variables. v == w does not do what you think it does, use theano.tensor.eq(v, w) instead. — Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/538#issuecomment-42092620 .
fonnesbeck commented 10 years ago

We have one, which it passes. Does my example fail for you?

jsalvatier commented 10 years ago

Yup, it fails for me too.

On Fri, May 2, 2014 at 7:56 PM, Chris Fonnesbeck notifications@github.comwrote:

We have one, which it passes. Does my example fail for you?

— Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/538#issuecomment-42094468 .

twiecki commented 10 years ago

547 has a bunch of changes to Wishart. Might that fix it? @kadeng?

kadeng commented 10 years ago

@twiecki: You should definitely take a look at the changes I made there. I think that the current Wishart Implementation does not really work. I found discrepancies between the likelihood formula in the code and the formulas I found in literature (for example, in Gelman's Bayesian Data Analysis and in Murphy's Machine Learning, a probabilistic perspective). What's problematic there is that different authors seem to use different parametrizations and seem to name the parameters inconsistently, so in the end I was pretty confused.

But it's not like I have spent a lot of time to validate that my fix in turn is correct and works, so please don't rely on it. Use it as a suggestion :) The following paper is also an interesting read when it comes to actually using the Wishart or inverse Wishart as prior for the precision or covariance matrix for MVNs: http://ba.stat.cmu.edu/journal/2013/vol08/issue02/huang.pdf

wgreenr commented 9 years ago

Recently have tried to sample from Wishart prior using:

import pymc3 as pm
import numpy as np

prec_prior = np.array([[ 25.3968254,   -1.58730159],
                       [ -1.58730159,   6.34920635]])
with pm.Model() as model:
    prec = pm.Wishart('prec', 100.0, prec_prior / 100.0, shape=(2, 2))
    step = pm.Metropolis()
    trace = pm.sample(10000, step)

And instead I obtained clearly diverging sampling process, with some entries of sampled matrices moving towards negative infinity, and not in any way resembling the prior precision matrix:

wishart

Then I decided to check whether my notions about Wishart distribution is fundamentally flawed, and using the code from http://www.mit.edu/~mattjj/released-code/hsmm/stats_util.py, I obtained samples which look as one might expect, given a quite narrow prior centered at precision matrix prec_prior:

    sample_wishart(prec_prior / 100.0, 100.0)
    array([[ 26.37928925,  -3.88792305],
           [ -3.88792305,   7.6372514 ]])

The question is: does this behavior of MCMC sampler is something expected, or is this a bug, and how to properly sample from Wishart?

kiudee commented 9 years ago

Yes, this is a bug and the same behavior I observed with the Wishart distribution (see issue #672). If you can find the source of the error in the current code, we would greatly appreciate it.

I recommend taking a look at the LKJ distribution, which is a prior distribution for the correlation matrix instead of the covariance/precision matrix. This has the advantage that you can specify separate priors for the standard deviations and the correlations.

wgreenr commented 9 years ago

Will run some tests, Wishart logp seems to look good

twiecki commented 9 years ago

Seems like we're not enforcing the tau > 0 constraint.

kiudee commented 9 years ago

Indeed, this is the case. (See below)

I added the constraint and tried it with the given example:

import numpy as np
import pymc3 as pm
import scipy.optimize as opt
prec_prior = np.array([[ 25.3968254,   -1.58730159],
                       [ -1.58730159,   6.34920635]])
with pm.Model() as model:
    prec = pm.Wishart('prec', 100.0, prec_prior / 100.0, shape=(2, 2))

    start = pm.find_MAP(fmin=opt.fmin_powell)
    step = pm.Metropolis()
    trace = pm.sample(10000, step, start)

Here is the resulting trace (without any thinning): Precision trace

PS: I’m sorry, it is too late here. I just realized that this is not true. The matrices need to be positive definite.

twiecki commented 9 years ago

Oh, right. I experimented a bit with all(gt(eig(X), 0)) as a condition but didn't get it to work yet.

twiecki commented 9 years ago

It should actually read all(gt(eigh(X)[0], 0)) which at least does not raise an exception. However, I'm still seeing the divergent sampling... maybe @karlnapf can advise on sampling from the Wishart distribution.

twiecki commented 9 years ago

The other thing we have to enforce is the symmetry:

        return bound(
            ((n - p - 1) * log(IXI) - trace(matrix_inverse(V).dot(X)) -
                n * p * log(2) - n * log(IVI) - 2 * multigammaln(n / 2., p)) / 2,
            gt(n, (p - 1)),
            all(gt(eigh(X)[0], 0)),
            eq(X, X.T)
        )

Unfortunately, it now becomes quite obvious that trying to sample this matrix naively with these constraints is impossible, so we should probably try to sample only the upper triangular and use the cholesky decomposition.

kadeng commented 9 years ago

Sampling from the Wishart and Inverse Wishart seems to be something that has been the attention of some rather sophisticated analysis to date. Make sure you take at least a quick look at these ..

http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf http://www.gwu.edu/~forcpgm/YuChengKu-030510final-WishartYu-ChengKu.pdf

best,

Kai

On 18 April 2015 at 11:56, Thomas Wiecki notifications@github.com wrote:

The other thing we have to enforce is the symmetry:

    return bound(
        ((n - p - 1) * log(IXI) - trace(matrix_inverse(V).dot(X)) -
            n * p * log(2) - n * log(IVI) - 2 * multigammaln(n / 2., p)) / 2,
        gt(n, (p - 1)),
        all(gt(eigh(X)[0], 0)),
        eq(X, X.T)
    )

Unfortunately, it now becomes quite obvious that trying to sample this matrix naively with these constraints is impossible, so we should probably try to sample only the upper triangular and use the cholesky decomposition.

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/538#issuecomment-94150114.

twiecki commented 9 years ago

Thanks @kadeng. I'm starting to wonder if we shouldn't just ditch the Wishart. @betanalpha suggests here: https://www.youtube.com/watch?v=xWQpEAyI5s8 that it's a terrible choice as a prior to begin with and suggests using the LKJ which we already have working.

betanalpha commented 9 years ago

Referencing a youtube video -- is that a first? There are two problems going on with the Wishart.

Firstly, because of the positive-definite and symmetric constraints the probability of generating a valid sample by varying every element is zero (technically nonzero on floating point, but small enough to make it unfeasible). Really the only way to sample from these objects is to transform the target distribution to the N * (N + 1) / 2 dimensional unconstrained space, sample there, and then map the samples back to the constrained space. For details see the "Transformations of Constrained Variables" chapter in the Stan manual.

Secondly, the Wishart distribution is very heavy-tailed which causes all kinds of problems for naive samplers. Formally heavy tails are obstructions to geometric ergodicity which is essentially the minimal condition needed to trust MCMC expectations in practice. More intuitively, when the tails of the target distribution are heavy the sampler drifts around in the tails and can't find its way back into the bulk of the distribution. NUTS will technically work okay, but only if you let the depth of the tree grow very, very large and incur the resulting computational cost. LKJ has much nicer tails and consequently pairs much better with most MCMC samplers. We'd also argue that the shape of LKJ priors are more inline with modeler's intuitions than Wishart (in particular the really nice interpretation is terms of the Cholesky factor), hence why we recommend them as defaults.

twiecki commented 9 years ago

Many thanks for chiming in @betanalpha, I actually meant to connect with you to get some feedback on pymc3 if you don't mind. One thing that stan does very well is the automatic transformations. I think it wouldn't be too hard to implement something similar here.

I now added a warning not to use the Wishart unless we fix it using the transform method.

Here's an example of how one can use the LKJ prior in pymc3: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/LKJ_correlation.py

wgreenr commented 9 years ago

Thanks everyone, especially @betanalpha and @twiecki, for insightful comments! It seems that LKG is the way to go.

twiecki commented 9 years ago

I tried the transform approach as suggested by @betanalpha and tried to follow the Stan docs. This is the model I came up with. @betanalpha if you have a sec it'd be great to get your input on whether you think this is correctly set up, the results make sense but this is the first time for me to work with multivariate transformations.

import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt

covariance = np.matrix([[2, .5],
                        [.5, 1]])

prec = np.linalg.inv(covariance)

mean = [.5, 1]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)

plt.scatter(data[:, 0], data[:, 1])

S = np.matrix([[1, .5],
               [.5, 1]])
L = scipy.linalg.cholesky(S)
nu = 5
with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=1, shape=2)
    c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 4), shape=2))
    z = pm.Normal('z', 0, 1)
    A = T.stacklists([[c[0], 0], 
                      [z, c[1]]])

    # L * A * A.T * L.T ~ Wishart(L*L.T, nu)
    wishart = pm.Deterministic('wishart', T.dot(T.dot(T.dot(L, A), A.T), L.T))
    cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(wishart))
    lp = pm.MvNormal('likelihood', mu=mu, tau=wishart, observed=data)

    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    trace = pm.sample(500, step)

pm.traceplot(trace[100:]);

print np.mean(trace['wishart'], axis=0)
print prec

print np.mean(trace['cov'], axis=0)
print covariance

Output:

[[ 0.57725294 -0.29173294]
 [-0.29173294  1.16447192]]
[[ 0.57142857 -0.28571429]
 [-0.28571429  1.14285714]]

[[ 1.9955087   0.50381087]
 [ 0.50381087  0.9889754 ]]
[[ 2.   0.5]
 [ 0.5  1. ]]

index

twiecki commented 9 years ago

I refactored this now into a function with the same syntax as a distribution. However, it's a bit odd as it adds RVs itself to the model. I think it's pretty convenient though. @jsalvatier @fonnesbeck thoughts on adding this?

import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt

covariance = np.matrix([[2, .5, -.5],
                        [.5, 1.,  0.],
                        [-.5, 0., 0.5]])

prec = np.linalg.inv(covariance)

mean = [.5, 1, .2]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)

plt.scatter(data[:, 0], data[:, 1])

S = np.eye(3)
L = scipy.linalg.cholesky(S)
nu = 5

def WishartBartlett(name, S, nu, is_cholesky=False, return_cholesky=False):
    """
    Bartlett decomposition of the Wishart distribution. As the Wishart
    distribution requires the matrix to be symmetric positive semi-definite
    it is impossible for MCMC to ever propose acceptable matrices.

    Instead, we can use the Barlett decomposition which samples a lower
    diagonal matrix. Specifically:

    If L ~ [[sqrt(c_1), 0, ...],
             [z_21, sqrt(c_1), 0, ...],
             [z_31, z32, sqrt(c3), ...]]
    with c_i ~ Chi²(n-i+1) and n_ij ~ N(0, 1), then
    L * A * A.T * L.T ~ Wishart(L * L.T, nu)

    See http://en.wikipedia.org/wiki/Wishart_distribution#Bartlett_decomposition
    for more information.

    :Parameters:
      S : ndarray
        p x p positive definite matrix
      nu : int
        Degrees of freedom, > dim(S).
      is_cholesky : bool (default=False)
        Input matrix S is already Cholesky decomposed as S.T * S
      return_cholesky : bool (default=False)
        Only return the Cholesky decomposed matrix.

    :Note:
      This is not a standard Distribution class but follows a similar
      interface. Besides, the Wishart distribution, it will add RVs 
      c and z to your model which make up the matrix.
    """

    L = S if is_cholesky else scipy.linalg.cholesky(S)

    diag_idx = np.diag_indices_from(S)
    tril_idx = np.tril_indices_from(S, k=-1)
    n_diag = len(diag_idx[0])
    n_tril = len(tril_idx[0])
    c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 2+n_diag), shape=n_diag))
    z = pm.Normal('z', 0, 1, shape=n_tril)
    # Construct A matrix
    A = T.zeros(S.shape, dtype=np.float32)
    A = T.set_subtensor(A[diag_idx], c)
    A = T.set_subtensor(A[tril_idx], z)

    # L * A * A.T * L.T ~ Wishart(L*L.T, nu)
    if return_cholesky:
        return pm.Deterministic(name, T.dot(L, A))
    else:
        return pm.Deterministic(name, T.dot(T.dot(T.dot(L, A), A.T), L.T))

with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=1, shape=3)

    prec = WishartBartlett('prec', S, nu)
    cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(prec))

    lp = pm.MvNormal('likelihood', mu=mu, tau=prec, observed=data)

    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    trace = pm.sample(500, step)

pm.traceplot(trace[100:]);
betanalpha commented 9 years ago

Sorry for the late reply -- seems reasonable although I can't comment on the integration of the transformation into your sampler.