Closed fonnesbeck closed 8 years ago
I think the issue is that you didn't specify 'shape'. That's definitely something we should be able to infer thought.
On Fri, May 2, 2014 at 12:27 PM, Chris Fonnesbeck notifications@github.comwrote:
Models that use the Wishart as a prior currently fail with valid parameters. For example,
s = pm.Wishart('s', 4, 3, np.eye(3))
Fails with the following:
TypeError: Wrong number of dimensions: expected 0, got 2 with shape (3, 3)
I've tried a variety of 3x3 matrices, but all yield similar errors.
— Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/538 .
Yeah, I tried that, and got:
---------------------------------------------------------------------------
AsTensorError Traceback (most recent call last)
<ipython-input-73-6f02cca3962c> in <module>()
1 with pm.Model():
----> 2 s = pm.Wishart('s', 4, 3, np.eye(3), shape=(3,3))
/Users/fonnescj/Code/pymc/pymc/distributions/distribution.pyc in __new__(cls, name, *args, **kwargs)
17 data = kwargs.pop('observed', None)
18 dist = cls.dist(*args, **kwargs)
---> 19 return model.Var(name, dist, data)
20 elif name is None:
21 return object.__new__(cls) #for pickle
/Users/fonnescj/Code/pymc/pymc/model.pyc in Var(self, name, dist, data)
142 """
143 if data is None:
--> 144 var = FreeRV(name=name, distribution=dist, model=self)
145 self.free_RVs.append(var)
146 else:
/Users/fonnescj/Code/pymc/pymc/model.pyc in __init__(self, type, owner, index, name, distribution, model)
321 self.tag.test_value = np.ones(
322 distribution.shape, distribution.dtype) * distribution.default()
--> 323 self.logp_elemwiset = distribution.logp(self)
324 self.model = model
325
/Users/fonnescj/Code/pymc/pymc/distributions/multivariate.pyc in logp(self, X)
179 2) - n * log(IVI) - 2 * multigammaln(p, n / 2)) / 2,
180
--> 181 all(n > p - 1))
/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in all(x, axis, keepdims)
5078
5079 def all(x, axis=None, keepdims=False):
-> 5080 out = elemwise.All(axis)(x)
5081
5082 if keepdims:
/Library/Python/2.7/site-packages/theano/gof/op.pyc in __call__(self, *inputs, **kwargs)
409 """
410 return_list = kwargs.pop('return_list', False)
--> 411 node = self.make_node(*inputs, **kwargs)
412 if self.add_stack_trace_on_call:
413 self.add_tag_trace(node)
/Library/Python/2.7/site-packages/theano/tensor/elemwise.pyc in make_node(self, input)
1639
1640 def make_node(self, input):
-> 1641 input = as_tensor_variable(input)
1642 if input.dtype not in ["int8", "uint8"]:
1643 input = theano.tensor.neq(input, 0)
/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in as_tensor_variable(x, name, ndim)
181 if isinstance(x, bool):
182 raise AsTensorError(
--> 183 "Cannot cast True or False as a tensor variable. Please use 1 or "
184 "0. This error might be caused by using the == operator on "
185 "Variables. v == w does not do what you think it does, "
AsTensorError: Cannot cast True or False as a tensor variable. Please use 1 or 0. This error might be caused by using the == operator on Variables. v == w does not do what you think it does, use theano.tensor.eq(v, w) instead.
It does not seem to like the boundary check.
Hmm that seems bad. Do we have tests for it? On May 2, 2014 6:19 PM, "Chris Fonnesbeck" notifications@github.com wrote:
Yeah, I tried that, and got:
AsTensorError Traceback (most recent call last)
in () 1 with pm.Model(): ----> 2 s = pm.Wishart('s', 4, 3, np.eye(3), shape=(3,3)) /Users/fonnescj/Code/pymc/pymc/distributions/distribution.pyc in **new**(cls, name, _args, *_kwargs) 17 data = kwargs.pop('observed', None) 18 dist = cls.dist(_args, *_kwargs) ---> 19 return model.Var(name, dist, data) 20 elif name is None: 21 return object.**new**(cls) #for pickle /Users/fonnescj/Code/pymc/pymc/model.pyc in Var(self, name, dist, data) 142 """ 143 if data is None: --> 144 var = FreeRV(name=name, distribution=dist, model=self) 145 self.free_RVs.append(var) 146 else: /Users/fonnescj/Code/pymc/pymc/model.pyc in **init**(self, type, owner, index, name, distribution, model) 321 self.tag.test_value = np.ones( 322 distribution.shape, distribution.dtype) \* distribution.default() --> 323 self.logp_elemwiset = distribution.logp(self) 324 self.model = model 325 /Users/fonnescj/Code/pymc/pymc/distributions/multivariate.pyc in logp(self, X) 179 2) - n \* log(IVI) - 2 \* multigammaln(p, n / 2)) / 2, 180 --> 181 all(n > p - 1)) /Library/Python/2.7/site-packages/theano/tensor/basic.pyc in all(x, axis, keepdims) 5078 5079 def all(x, axis=None, keepdims=False): -> 5080 out = elemwise.All(axis)(x) 5081 5082 if keepdims: /Library/Python/2.7/site-packages/theano/gof/op.pyc in **call**(self, _inputs, *_kwargs) 409 """ 410 return_list = kwargs.pop('return_list', False) --> 411 node = self.make_node(_inputs, *_kwargs) 412 if self.add_stack_trace_on_call: 413 self.add_tag_trace(node) /Library/Python/2.7/site-packages/theano/tensor/elemwise.pyc in make_node(self, input) 1639 1640 def make_node(self, input): -> 1641 input = as_tensor_variable(input) 1642 if input.dtype not in ["int8", "uint8"]: 1643 input = theano.tensor.neq(input, 0) /Library/Python/2.7/site-packages/theano/tensor/basic.pyc in as_tensor_variable(x, name, ndim) 181 if isinstance(x, bool): 182 raise AsTensorError( --> 183 "Cannot cast True or False as a tensor variable. Please use 1 or " 184 "0. This error might be caused by using the == operator on " 185 "Variables. v == w does not do what you think it does, " AsTensorError: Cannot cast True or False as a tensor variable. Please use 1 or 0. This error might be caused by using the == operator on Variables. v == w does not do what you think it does, use theano.tensor.eq(v, w) instead. — Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/538#issuecomment-42092620 .
We have one, which it passes. Does my example fail for you?
Yup, it fails for me too.
On Fri, May 2, 2014 at 7:56 PM, Chris Fonnesbeck notifications@github.comwrote:
We have one, which it passes. Does my example fail for you?
— Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/538#issuecomment-42094468 .
Wishart
. Might that fix it? @kadeng?@twiecki: You should definitely take a look at the changes I made there. I think that the current Wishart Implementation does not really work. I found discrepancies between the likelihood formula in the code and the formulas I found in literature (for example, in Gelman's Bayesian Data Analysis and in Murphy's Machine Learning, a probabilistic perspective). What's problematic there is that different authors seem to use different parametrizations and seem to name the parameters inconsistently, so in the end I was pretty confused.
But it's not like I have spent a lot of time to validate that my fix in turn is correct and works, so please don't rely on it. Use it as a suggestion :) The following paper is also an interesting read when it comes to actually using the Wishart or inverse Wishart as prior for the precision or covariance matrix for MVNs: http://ba.stat.cmu.edu/journal/2013/vol08/issue02/huang.pdf
Recently have tried to sample from Wishart prior using:
import pymc3 as pm
import numpy as np
prec_prior = np.array([[ 25.3968254, -1.58730159],
[ -1.58730159, 6.34920635]])
with pm.Model() as model:
prec = pm.Wishart('prec', 100.0, prec_prior / 100.0, shape=(2, 2))
step = pm.Metropolis()
trace = pm.sample(10000, step)
And instead I obtained clearly diverging sampling process, with some entries of sampled matrices moving towards negative infinity, and not in any way resembling the prior precision matrix:
Then I decided to check whether my notions about Wishart distribution is fundamentally flawed, and using the code from http://www.mit.edu/~mattjj/released-code/hsmm/stats_util.py, I obtained samples which look as one might expect, given a quite narrow prior centered at precision matrix prec_prior:
sample_wishart(prec_prior / 100.0, 100.0)
array([[ 26.37928925, -3.88792305],
[ -3.88792305, 7.6372514 ]])
The question is: does this behavior of MCMC sampler is something expected, or is this a bug, and how to properly sample from Wishart?
Yes, this is a bug and the same behavior I observed with the Wishart distribution (see issue #672). If you can find the source of the error in the current code, we would greatly appreciate it.
I recommend taking a look at the LKJ distribution, which is a prior distribution for the correlation matrix instead of the covariance/precision matrix. This has the advantage that you can specify separate priors for the standard deviations and the correlations.
Will run some tests, Wishart logp seems to look good
Seems like we're not enforcing the tau > 0 constraint.
Indeed, this is the case. (See below)
I added the constraint and tried it with the given example:
import numpy as np
import pymc3 as pm
import scipy.optimize as opt
prec_prior = np.array([[ 25.3968254, -1.58730159],
[ -1.58730159, 6.34920635]])
with pm.Model() as model:
prec = pm.Wishart('prec', 100.0, prec_prior / 100.0, shape=(2, 2))
start = pm.find_MAP(fmin=opt.fmin_powell)
step = pm.Metropolis()
trace = pm.sample(10000, step, start)
Here is the resulting trace (without any thinning):
PS: I’m sorry, it is too late here. I just realized that this is not true. The matrices need to be positive definite.
Oh, right. I experimented a bit with all(gt(eig(X), 0))
as a condition but didn't get it to work yet.
It should actually read all(gt(eigh(X)[0], 0))
which at least does not raise an exception. However, I'm still seeing the divergent sampling... maybe @karlnapf can advise on sampling from the Wishart distribution.
The other thing we have to enforce is the symmetry:
return bound(
((n - p - 1) * log(IXI) - trace(matrix_inverse(V).dot(X)) -
n * p * log(2) - n * log(IVI) - 2 * multigammaln(n / 2., p)) / 2,
gt(n, (p - 1)),
all(gt(eigh(X)[0], 0)),
eq(X, X.T)
)
Unfortunately, it now becomes quite obvious that trying to sample this matrix naively with these constraints is impossible, so we should probably try to sample only the upper triangular and use the cholesky decomposition.
Sampling from the Wishart and Inverse Wishart seems to be something that has been the attention of some rather sophisticated analysis to date. Make sure you take at least a quick look at these ..
http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf http://www.gwu.edu/~forcpgm/YuChengKu-030510final-WishartYu-ChengKu.pdf
best,
Kai
On 18 April 2015 at 11:56, Thomas Wiecki notifications@github.com wrote:
The other thing we have to enforce is the symmetry:
return bound( ((n - p - 1) * log(IXI) - trace(matrix_inverse(V).dot(X)) - n * p * log(2) - n * log(IVI) - 2 * multigammaln(n / 2., p)) / 2, gt(n, (p - 1)), all(gt(eigh(X)[0], 0)), eq(X, X.T) )
Unfortunately, it now becomes quite obvious that trying to sample this matrix naively with these constraints is impossible, so we should probably try to sample only the upper triangular and use the cholesky decomposition.
— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/538#issuecomment-94150114.
Thanks @kadeng. I'm starting to wonder if we shouldn't just ditch the Wishart. @betanalpha suggests here: https://www.youtube.com/watch?v=xWQpEAyI5s8 that it's a terrible choice as a prior to begin with and suggests using the LKJ which we already have working.
Referencing a youtube video -- is that a first? There are two problems going on with the Wishart.
Firstly, because of the positive-definite and symmetric constraints the probability of generating a valid sample by varying every element is zero (technically nonzero on floating point, but small enough to make it unfeasible). Really the only way to sample from these objects is to transform the target distribution to the N * (N + 1) / 2 dimensional unconstrained space, sample there, and then map the samples back to the constrained space. For details see the "Transformations of Constrained Variables" chapter in the Stan manual.
Secondly, the Wishart distribution is very heavy-tailed which causes all kinds of problems for naive samplers. Formally heavy tails are obstructions to geometric ergodicity which is essentially the minimal condition needed to trust MCMC expectations in practice. More intuitively, when the tails of the target distribution are heavy the sampler drifts around in the tails and can't find its way back into the bulk of the distribution. NUTS will technically work okay, but only if you let the depth of the tree grow very, very large and incur the resulting computational cost. LKJ has much nicer tails and consequently pairs much better with most MCMC samplers. We'd also argue that the shape of LKJ priors are more inline with modeler's intuitions than Wishart (in particular the really nice interpretation is terms of the Cholesky factor), hence why we recommend them as defaults.
Many thanks for chiming in @betanalpha, I actually meant to connect with you to get some feedback on pymc3
if you don't mind. One thing that stan
does very well is the automatic transformations. I think it wouldn't be too hard to implement something similar here.
I now added a warning not to use the Wishart unless we fix it using the transform method.
Here's an example of how one can use the LKJ prior in pymc3
:
https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/LKJ_correlation.py
Thanks everyone, especially @betanalpha and @twiecki, for insightful comments! It seems that LKG is the way to go.
I tried the transform approach as suggested by @betanalpha and tried to follow the Stan docs. This is the model I came up with. @betanalpha if you have a sec it'd be great to get your input on whether you think this is correctly set up, the results make sense but this is the first time for me to work with multivariate transformations.
import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt
covariance = np.matrix([[2, .5],
[.5, 1]])
prec = np.linalg.inv(covariance)
mean = [.5, 1]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)
plt.scatter(data[:, 0], data[:, 1])
S = np.matrix([[1, .5],
[.5, 1]])
L = scipy.linalg.cholesky(S)
nu = 5
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1, shape=2)
c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 4), shape=2))
z = pm.Normal('z', 0, 1)
A = T.stacklists([[c[0], 0],
[z, c[1]]])
# L * A * A.T * L.T ~ Wishart(L*L.T, nu)
wishart = pm.Deterministic('wishart', T.dot(T.dot(T.dot(L, A), A.T), L.T))
cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(wishart))
lp = pm.MvNormal('likelihood', mu=mu, tau=wishart, observed=data)
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
trace = pm.sample(500, step)
pm.traceplot(trace[100:]);
print np.mean(trace['wishart'], axis=0)
print prec
print np.mean(trace['cov'], axis=0)
print covariance
Output:
[[ 0.57725294 -0.29173294]
[-0.29173294 1.16447192]]
[[ 0.57142857 -0.28571429]
[-0.28571429 1.14285714]]
[[ 1.9955087 0.50381087]
[ 0.50381087 0.9889754 ]]
[[ 2. 0.5]
[ 0.5 1. ]]
I refactored this now into a function with the same syntax as a distribution. However, it's a bit odd as it adds RVs itself to the model. I think it's pretty convenient though. @jsalvatier @fonnesbeck thoughts on adding this?
import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt
covariance = np.matrix([[2, .5, -.5],
[.5, 1., 0.],
[-.5, 0., 0.5]])
prec = np.linalg.inv(covariance)
mean = [.5, 1, .2]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)
plt.scatter(data[:, 0], data[:, 1])
S = np.eye(3)
L = scipy.linalg.cholesky(S)
nu = 5
def WishartBartlett(name, S, nu, is_cholesky=False, return_cholesky=False):
"""
Bartlett decomposition of the Wishart distribution. As the Wishart
distribution requires the matrix to be symmetric positive semi-definite
it is impossible for MCMC to ever propose acceptable matrices.
Instead, we can use the Barlett decomposition which samples a lower
diagonal matrix. Specifically:
If L ~ [[sqrt(c_1), 0, ...],
[z_21, sqrt(c_1), 0, ...],
[z_31, z32, sqrt(c3), ...]]
with c_i ~ Chi²(n-i+1) and n_ij ~ N(0, 1), then
L * A * A.T * L.T ~ Wishart(L * L.T, nu)
See http://en.wikipedia.org/wiki/Wishart_distribution#Bartlett_decomposition
for more information.
:Parameters:
S : ndarray
p x p positive definite matrix
nu : int
Degrees of freedom, > dim(S).
is_cholesky : bool (default=False)
Input matrix S is already Cholesky decomposed as S.T * S
return_cholesky : bool (default=False)
Only return the Cholesky decomposed matrix.
:Note:
This is not a standard Distribution class but follows a similar
interface. Besides, the Wishart distribution, it will add RVs
c and z to your model which make up the matrix.
"""
L = S if is_cholesky else scipy.linalg.cholesky(S)
diag_idx = np.diag_indices_from(S)
tril_idx = np.tril_indices_from(S, k=-1)
n_diag = len(diag_idx[0])
n_tril = len(tril_idx[0])
c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 2+n_diag), shape=n_diag))
z = pm.Normal('z', 0, 1, shape=n_tril)
# Construct A matrix
A = T.zeros(S.shape, dtype=np.float32)
A = T.set_subtensor(A[diag_idx], c)
A = T.set_subtensor(A[tril_idx], z)
# L * A * A.T * L.T ~ Wishart(L*L.T, nu)
if return_cholesky:
return pm.Deterministic(name, T.dot(L, A))
else:
return pm.Deterministic(name, T.dot(T.dot(T.dot(L, A), A.T), L.T))
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1, shape=3)
prec = WishartBartlett('prec', S, nu)
cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(prec))
lp = pm.MvNormal('likelihood', mu=mu, tau=prec, observed=data)
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
trace = pm.sample(500, step)
pm.traceplot(trace[100:]);
Sorry for the late reply -- seems reasonable although I can't comment on the integration of the transformation into your sampler.
Models that use the Wishart as a prior currently fail with valid parameters. For example,
Fails with the following:
I've tried a variety of 3x3 matrices, but all yield similar errors.
PS - I have a pull request that removed the redundant second parameter.