Closed alxempirical closed 6 years ago
Often that's due to bad initialization. Can you try to init with ADVI (e.g. http://pymc-devs.github.io/pymc3/notebooks/stochastic_volatility.html#Fit-Model)?
Oh, and hey @alxempirical, good to see you here and glad you're giving pymc3 a test-drive :).
For this model Metropolis
might also be good enough and probably way faster.
good to see you here
You too!
I don't think the problem is bad initialization. Here are the ADVIFit.means
values vs the find_MAP
values:
ADVI MAP
mu_constraints -1.4 -1.4
mu_targets -6.8 -9.4
C_1_DOF 0.57 0.84
tau_constraints 0.44 0.81
constraints 0.079 0.079
C_2_DOF 0.38 0.12
tau_targets 0.053 0.27
targets 2.5e-05 2.5e-05
tau_const 14 14
These values fit well within the sample distributions:
Metropolis
might also be good enough and probably way faster.
Thanks, I will give that a go.
Best regards, Alex
The cov matrix found by ADVI might still be better suited but I agree, find_MAP seems to find a reasonable starting point. How many data points do you pass in?
I'm training on 2,000 data points.
How do I access the cov(ariance?) matrix from advi? It's not mentioned as a return value in variational.advi
, only means
, stds
and elbo_vals
.
Also, any tips/docs on profiling? I would guess Theano confuses standard python profiling tools, but could be wrong.
step = pm.NUTS(scaling=model.dict_to_array(sds), is_cov=True)
uses the standard deviations as the diagonal. Writing that I realize that it should obviously use the variances
so I'll fix that.
There are some profiling tools in pymc3
that @jsalvatier wrote a while back but I haven't tried them. Would need to take a closer look.
is it faster with metropolis or slice?
See the model.profile
method for built-in profiling.
Thanks for the suggestions, both of you. I will try them both out when I productionize this code. I've got a decent run of results now, so I'm going to move on to the next step, for a while.
This still appears to be a problem. Even with ADVI initialization, this is what happens for a relatively simple model:
Ironically, this model used to work a few weeks ago. The model structure is as follows:
with Model() as model:
cases, n = inpatient_data[['cases', 'n']].values.T
# Cases weighted by monitoring effort and market share
weights = monitoring_rate[INPATIENT] * np.array(inpatient_market_share)[inpatient_data.year]
# Uniform prior on weighted cases
weighted_cases = Uniform('weighted_cases', cases, n, shape=inpatient_data.shape[0],
testval=(cases/weights).astype(int))
# Likelihood of observed cases
weighting = Binomial('weighting', n=weighted_cases.astype('int32'), p=weights, observed=cases)
if pool_years:
assert not len(cases) % 3
pooled_shape = int(len(cases)/3)
pooled_cases = Deterministic('pooled_cases', tt.reshape(weighted_cases, (pooled_shape, 3)).sum(-1))
θ = Normal('θ', 0, sd=1000, shape=pooled_shape)
λ = Deterministic('λ', tt.exp(θ))
AGE_obs = Potential('AGE_obs', Poisson.dist(λ*(np.reshape(n, (pooled_shape, 3))[:, 0])).logp(pooled_cases))
else:
θ = Normal('θ', 0, sd=1000, shape=inpatient_data.shape[0])
# Poisson rate parameter
λ = Deterministic('λ', tt.exp(θ))
# Poisson likelihood
AGE_obs = Potential('AGE_obs', Poisson.dist(λ*n).logp(weighted_cases))
@fonnesbeck maybe do a git bisect to see when this started happening?
Is this being run on 3.1 or master? There are some parts of NUTs that were refactored to use theano (notably, the leapfrog step).
I'm close to getting a full theano implementation of HMC and NUTs for 3.1 -- this will be a useful benchmark to be able to run (against whatever sha has reasonable performance).
@ColCarroll Would be curious to see the full theano implementation. Would hope for a significant speedup.
This is being run on master. It typically samples for a handful of iterations and stalls, even when ADVI initialization appears to work. I may just use ADVI for inference for expediency.
@fonnesbeck Well if it used to work it might indicate a regression which we better track down.
I ran the profiler on a hierarchical model. Once using:
with gp_fit:
step = pm.NUTS(profile=True)
pm.sample(500, step)
step.leapfrog1_dE.profile.summary()
Output:
Function profiling
==================
Message: /home/wiecki/working/projects/pymc/pymc3/step_methods/nuts.py:230
Time in 346725 calls to Function.__call__: 5.120690e+01s
Time in Function.fn.__call__: 4.736680e+01s (92.501%)
Time in thunks: 3.240043e+01s (63.274%)
Total compile time: 2.547425e+01s
Number of Apply nodes: 216
Theano Optimizer time: 4.430967e+00s
Theano validate time: 1.481969e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.013688e+01s
Import time 1.104512e-01s
Time in all call to theano.grad() 7.848223e+00s
Time since theano import 271.369s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
63.8% 63.8% 20.673s 4.81e-07s C 42993900 124 theano.tensor.elemwise.Elemwise
8.9% 72.7% 2.881s 2.77e-07s C 10401750 30 theano.tensor.basic.Reshape
8.1% 80.8% 2.628s 3.79e-06s C 693450 2 theano.tensor.basic.Join
5.3% 86.1% 1.705s 4.10e-07s C 4160700 12 theano.tensor.elemwise.DimShuffle
4.2% 90.3% 1.377s 2.21e-07s C 6241050 18 theano.compile.ops.Rebroadcast
4.0% 94.3% 1.289s 4.65e-07s C 2773800 8 theano.tensor.subtensor.Subtensor
1.8% 96.1% 0.576s 2.08e-07s C 2773800 8 theano.tensor.elemwise.Sum
1.4% 97.5% 0.462s 1.66e-07s C 2773800 8 theano.compile.ops.ViewOp
1.1% 98.6% 0.373s 3.58e-07s C 1040175 3 theano.tensor.basic.ScalarFromTensor
0.8% 99.4% 0.246s 7.09e-07s C 346725 1 theano.tensor.blas_c.CGemv
0.3% 99.7% 0.105s 3.02e-07s C 346725 1 theano.tensor.basic.AllocEmpty
0.3% 100.0% 0.087s 2.52e-07s C 346725 1 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
8.1% 8.1% 2.628s 3.79e-06s C 693450 2 Join
5.2% 13.4% 1.698s 2.72e-07s C 6241050 18 Reshape{1}
4.2% 17.6% 1.362s 3.93e-07s C 3467250 10 InplaceDimShuffle{x}
3.7% 21.3% 1.212s 3.50e-06s C 346725 1 Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * (Switch(EQ(i2, i3), i3, Switch(Composite{(GE(i0, i1) * LE(i0, i2))}(i2, i3, i4), (i5 / i2), i3)) + Switch(EQ((i6 - i2), i3), i3, Switch(Composite{(GE(i0, i1) * LE(i0, i2))}(i2, i3, i4), (-(i7 / (i6 - i2))), i3)))) + (scalar_sigmoid(i1) * (Switch(i8, i3, Switch(i9, (exp(i0) * i10), i3)) + Switch(i11, i3, Switch(i9, i12, i3)))) + (-((scalar_sigmoid(i0) * scalar_sigmoid(i1) * Com
3.7% 24.9% 1.183s 2.84e-07s C 4160700 12 Reshape{0}
3.5% 28.5% 1.141s 3.29e-06s C 346725 1 Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * (Switch(i2, i3, Switch(i4, (i5 / i6), i3)) + Switch(i7, i3, Switch(i4, (-(i8 / i9)), i3)))) + (scalar_sigmoid(i1) * (Switch(i10, i3, Switch(i11, (i12 * i13), i3)) + Switch(i14, i3, Switch(i11, i15, i3)))) + (-((scalar_sigmoid(i0) * scalar_sigmoid(i1) * Composite{((sgn((scalar_sigmoid(i0) * scalar_sigmoid(i1))) * i2 * i2) / i3)}(i0, i1, i16, i17)) + (scalar_sigmoid(i0) * scalar
3.3% 31.7% 1.055s 5.07e-07s C 2080350 6 sigmoid
2.7% 34.5% 0.884s 2.55e-06s C 346725 1 Elemwise{Composite{Switch(i0, (Switch(i1, i2, (i3 * log(i4))) + i5 + Switch(i6, i2, (i7 * log(i8)))), i2)}}[(0, 4)]
2.6% 37.1% 0.844s 4.06e-07s C 2080350 6 Subtensor{int64}
2.6% 39.7% 0.843s 2.03e-07s C 4160700 12 Rebroadcast{1}
2.3% 42.0% 0.745s 2.15e-06s C 346725 1 Elemwise{Composite{((-((i0 + i1 + Switch(i2, (Switch(i3, i4, (i5 * log(i6))) + i7 + Switch(EQ(i8, i9), i4, i10)), i4) + Switch((GE(i11, i12) * LE(i11, i13)), i12, i4) + log(i14) + Switch(i15, Switch(i16, (i17 - Switch(i18, i4, (i19 * scalar_softplus(i20)))), i4), i4) + log(i21) + Switch(i22, ((Switch(i23, i4, (i24 * i25 * i26)) + Switch(i27, i4, (i28 * i29 * i30))) - i31), i4) + log(i32)) - (i33 * i34))) - i35)}}[(0, 0)]
2.2% 44.1% 0.703s 2.90e-07s C 2427075 7 Elemwise{mul,no_inplace}
2.0% 46.2% 0.655s 1.89e-06s C 346725 1 Elemwise{Composite{(Switch(i0, ((Switch(i1, i2, (i3 * i4 * i5)) + Switch(i6, i2, (i7 * i8 * i9))) - i10), i2) + log((i11 / sqr(i12))))}}[(0, 11)]
2.0% 48.2% 0.653s 2.69e-07s C 2427075 7 Elemwise{neg,no_inplace}
2.0% 50.2% 0.643s 2.32e-07s C 2773800 8 Elemwise{eq,no_inplace}
2.0% 52.1% 0.633s 2.28e-07s C 2773800 8 Elemwise{add,no_inplace}
1.8% 53.9% 0.580s 4.19e-07s C 1386900 4 Elemwise{exp,no_inplace}
1.8% 55.7% 0.576s 2.08e-07s C 2773800 8 Sum{acc_dtype=float64}
1.8% 57.4% 0.573s 8.26e-07s C 693450 2 Elemwise{ScalarSoftplus}[(0, 0)]
1.7% 59.2% 0.556s 8.01e-07s C 693450 2 softplus
... (remaining 47 Ops account for 40.84%(13.23s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
4.5% 4.5% 1.449s 4.18e-06s 346725 94 Join(TensorConstant{0}, Rebroadcast{1}.0, Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * (Switch(EQ(i2, i3), i3, Switch(Composite{(GE(i0, i1) * LE(i0, i2))}(i2, i3, i4), (i5 / i2), i3)) + Switch(EQ((i6 - i2), i3), i3, Switch(Composite{(GE(i0, i1) * LE(i0, i2))}(i2, i3, i4), (-(i7 / (i6 - i2))), i3)))) + (scalar_sigmoid(i1) * (Switch(i8, i3, Switch(i9, (exp(i0) * i10), i3)) + Switch(i11, i3, Switch(i9, i12, i3)))) + (-((scalar_sigmoid
3.7% 8.2% 1.212s 3.50e-06s 346725 83 Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * (Switch(EQ(i2, i3), i3, Switch(Composite{(GE(i0, i1) * LE(i0, i2))}(i2, i3, i4), (i5 / i2), i3)) + Switch(EQ((i6 - i2), i3), i3, Switch(Composite{(GE(i0, i1) * LE(i0, i2))}(i2, i3, i4), (-(i7 / (i6 - i2))), i3)))) + (scalar_sigmoid(i1) * (Switch(i8, i3, Switch(i9, (exp(i0) * i10), i3)) + Switch(i11, i3, Switch(i9, i12, i3)))) + (-((scalar_sigmoid(i0) * scalar_sigmoid(i1) * Composite{((sg
3.6% 11.9% 1.179s 3.40e-06s 346725 209 Join(TensorConstant{0}, Rebroadcast{1}.0, Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * (Switch(i2, i3, Switch(i4, (i5 / i6), i3)) + Switch(i7, i3, Switch(i4, (-(i8 / i9)), i3)))) + (scalar_sigmoid(i1) * (Switch(i10, i3, Switch(i11, (i12 * i13), i3)) + Switch(i14, i3, Switch(i11, i15, i3)))) + (-((scalar_sigmoid(i0) * scalar_sigmoid(i1) * Composite{((sgn((scalar_sigmoid(i0) * scalar_sigmoid(i1))) * i2 * i2) / i3)}(i0, i1, i16, i17))
3.5% 15.4% 1.141s 3.29e-06s 346725 173 Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * (Switch(i2, i3, Switch(i4, (i5 / i6), i3)) + Switch(i7, i3, Switch(i4, (-(i8 / i9)), i3)))) + (scalar_sigmoid(i1) * (Switch(i10, i3, Switch(i11, (i12 * i13), i3)) + Switch(i14, i3, Switch(i11, i15, i3)))) + (-((scalar_sigmoid(i0) * scalar_sigmoid(i1) * Composite{((sgn((scalar_sigmoid(i0) * scalar_sigmoid(i1))) * i2 * i2) / i3)}(i0, i1, i16, i17)) + (scalar_sigmoid(i0) * scalar_sigmoid(i0
2.7% 18.1% 0.884s 2.55e-06s 346725 177 Elemwise{Composite{Switch(i0, (Switch(i1, i2, (i3 * log(i4))) + i5 + Switch(i6, i2, (i7 * log(i8)))), i2)}}[(0, 4)](Elemwise{Composite{(GE(i0, i1) * LE(i0, i2))}}.0, Elemwise{eq,no_inplace}.0, TensorConstant{(1,) of -inf}, TensorConstant{[ 27. 28... 37. 38.]}, Elemwise{sub,no_inplace}.0, TensorConstant{[ 28.17094...63057366]}, Elemwise{eq,no_inplace}.0, TensorConstant{[ 18. 17... 8. 7.]}, thetas)
2.3% 20.4% 0.745s 2.15e-06s 346725 215 Elemwise{Composite{((-((i0 + i1 + Switch(i2, (Switch(i3, i4, (i5 * log(i6))) + i7 + Switch(EQ(i8, i9), i4, i10)), i4) + Switch((GE(i11, i12) * LE(i11, i13)), i12, i4) + log(i14) + Switch(i15, Switch(i16, (i17 - Switch(i18, i4, (i19 * scalar_softplus(i20)))), i4), i4) + log(i21) + Switch(i22, ((Switch(i23, i4, (i24 * i25 * i26)) + Switch(i27, i4, (i28 * i29 * i30))) - i31), i4) + log(i32)) - (i33 * i34))) - i35)}}[(0, 0)](Sum{acc_dtype=float64}.0, Su
2.0% 22.4% 0.655s 1.89e-06s 346725 182 Elemwise{Composite{(Switch(i0, ((Switch(i1, i2, (i3 * i4 * i5)) + Switch(i6, i2, (i7 * i8 * i9))) - i10), i2) + log((i11 / sqr(i12))))}}[(0, 11)](Elemwise{Composite{(GE(i0, i1) * LE(i0, i2) * i3 * i4)}}.0, Elemwise{eq,no_inplace}.0, TensorConstant{(1,) of -inf}, TensorConstant{(1,) of -1.0}, InplaceDimShuffle{x}.0, Elemwise{ScalarSoftplus}[(0, 0)].0, Elemwise{Composite{EQ((i0 - i1), i2)}}.0, TensorConstant{(1,) of -1.0}, InplaceDimShuffle{x}.0, soft
1.5% 24.0% 0.495s 1.43e-06s 346725 68 Elemwise{Composite{Switch(i0, i1, Switch(i2, (-scalar_softplus(i3)), i1))}}(Elemwise{Composite{EQ((i0 - i1), i2)}}.0, TensorConstant{(1,) of 0}, Elemwise{Composite{(GE(i0, i1) * LE(i0, i2) * i3 * i4)}}.0, Subtensor{int64:int64:}.0)
1.5% 25.5% 0.494s 1.42e-06s 346725 197 Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * ((i2 / i3) + Switch(i4, i5, Switch(i6, (i7 / i8), i5)))) + (scalar_sigmoid(i1) * (Switch(i9, i5, Switch(i10, (exp(i0) * i11), i5)) + Switch(i12, i5, Switch(i10, (-i13), i5)))) + (-((scalar_sigmoid(i0) * scalar_sigmoid(i1) * i14) + (scalar_sigmoid(i0) * scalar_sigmoid(i0) * scalar_sigmoid(i1) * i15 * i14) + (scalar_sigmoid(i0) * scalar_sigmoid(i0) * scalar_sigmoid(i1) * i15 * i14))))}}[(0
1.5% 27.0% 0.486s 1.40e-06s 346725 84 Elemwise{Composite{Switch(i0, i1, Switch(i2, (-scalar_softplus(i3)), i1))}}[(0, 3)](Elemwise{eq,no_inplace}.0, TensorConstant{(1,) of 0}, Elemwise{Composite{(GE(i0, i1) * LE(i0, i2) * i3 * i4)}}.0, Elemwise{neg,no_inplace}.0)
1.4% 28.4% 0.463s 1.33e-06s 346725 176 Elemwise{ScalarSoftplus}[(0, 0)](Elemwise{neg,no_inplace}.0)
1.4% 29.8% 0.449s 1.29e-06s 346725 77 Elemwise{Composite{((scalar_sigmoid(i0) * scalar_sigmoid(i1) * ((i2 / i3) + Switch(EQ((i4 - i3), i5), i5, Switch((GE(i3, i5) * LE(i3, i6)), (i7 / (i4 - i3)), i5)))) + (scalar_sigmoid(i1) * (Switch(i8, i5, Switch(i9, (exp(i0) * i10), i5)) + Switch(i11, i5, Switch(i9, (-i12), i5)))) + (-((scalar_sigmoid(i0) * scalar_sigmoid(i1) * i13) + (scalar_sigmoid(i0) * scalar_sigmoid(i0) * scalar_sigmoid(i1) * i14 * i13) + (scalar_sigmoid(i0) * scalar_sigmoid(i0
1.3% 31.0% 0.407s 1.18e-06s 346725 107 softplus(Subtensor{int64:int64:}.0)
1.1% 32.2% 0.369s 1.06e-06s 346725 5 sigmoid(Subtensor{int64:int64:}.0)
1.0% 33.2% 0.328s 9.45e-07s 346725 89 Elemwise{Composite{((((i0 * i1) + (i2 * i3) + (i4 * i3) + (i5 * i1)) * i6) + Switch(EQ(i7, i8), i8, Switch(GE(i7, i9), Switch((GE(i7, i10) * LE(i7, i11)), (scalar_sigmoid(i12) * i13), i8), i8)) + (i14 * i6))}}[(0, 0)](Elemwise{Composite{(Switch(i0, i1, Switch(i2, (-scalar_softplus(i3)), i1)) + Switch(i2, (-i4), i1) + i5)}}[(0, 4)].0, Elemwise{sub,no_inplace}.0, Elemwise{Composite{(Switch(i0, i1, Switch(i2, (-scalar_softplus(i3)), i1)) + Switch(i2, (
0.9% 34.1% 0.300s 8.65e-07s 346725 105 sigmoid(Subtensor{int64:int64:}.0)
0.9% 35.0% 0.293s 8.45e-07s 346725 204 Elemwise{Composite{((((i0 * i1) + (i2 * i3) + (i4 * i3) + (i5 * i1)) * i6) + Switch(i7, i8, Switch(i9, Switch(i10, (scalar_sigmoid(i11) * i12), i8), i8)) + (i13 * i6))}}[(0, 0)](Elemwise{Composite{(Switch(i0, i1, Switch(i2, (-i3), i1)) + Switch(i2, (-i4), i1) + i5)}}[(0, 4)].0, Elemwise{sub,no_inplace}.0, Elemwise{Composite{(Switch(i0, i1, Switch(i2, (-i3), i1)) + Switch(i2, (-i4), i1) + i5)}}[(0, 4)].0, phi, Elemwise{Composite{(i0 + (i1 * i2) + i3)
0.9% 35.9% 0.282s 8.13e-07s 346725 118 Elemwise{exp,no_inplace}(Elemwise{neg,no_inplace}.0)
0.9% 36.8% 0.281s 8.12e-07s 346725 3 InplaceDimShuffle{x}(e)
0.8% 37.6% 0.263s 7.58e-07s 346725 81 Elemwise{Composite{(Switch(i0, i1, Switch(i2, (-scalar_softplus(i3)), i1)) + Switch(i2, (-i4), i1) + i5)}}[(0, 4)](Elemwise{Composite{EQ((i0 - i1), i2)}}.0, TensorConstant{0}, Elemwise{Composite{(GE(i0, i1) * LE(i0, i2) * i3 * i4)}}[(0, 3)].0, Subtensor{int64}.0, Elemwise{scalar_psi}.0, Elemwise{Switch}[(0, 1)].0)
... (remaining 196 Apply instances account for 62.42%(20.23s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
- Try the Theano flag floatX=float32
We don't know if amdlibm will accelerate this scalar op. scalar_psi
We don't know if amdlibm will accelerate this scalar op. Psi
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
We don't know if amdlibm will accelerate this scalar op. Psi
We don't know if amdlibm will accelerate this scalar op. Psi
We don't know if amdlibm will accelerate this scalar op. scalar_psi
We don't know if amdlibm will accelerate this scalar op. Psi
- Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.
Once using:
%load_ext line_profiler
with gp_fit:
step = pm.NUTS(profile=True)
%lprun -f step.astep pm.sample(500, step)
Output:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
121 def astep(self, q0):
122 500 1151 2.3 0.0 leapfrog = self.leapfrog1_dE
123 500 839 1.7 0.0 Emax = self.Emax
124 500 760 1.5 0.0 e = self.step_size
125
126 500 8408 16.8 0.3 p0 = self.potential.random()
127 500 822595 1645.2 28.0 E0 = self.compute_energy(q0, p0)
128
129 500 2124 4.2 0.1 u = uniform()
130 500 752 1.5 0.0 q = qn = qp = q0
131 500 701 1.4 0.0 p = pn = pp = p0
132
133 500 706 1.4 0.0 n, s, j = 1, 1, 0
134
135 1001 1865 1.9 0.1 while s == 1:
136 501 2323 4.6 0.1 v = bern(.5) * 2 - 1
137
138 501 757 1.5 0.0 if v == -1:
139 247 336 1.4 0.0 qn, pn, _, _, q1, n1, s1, a, na = buildtree(
140 247 981258 3972.7 33.5 leapfrog, qn, pn, u, v, j, e, Emax, E0)
141 else:
142 254 353 1.4 0.0 _, _, qp, pp, q1, n1, s1, a, na = buildtree(
143 254 1083173 4264.5 36.9 leapfrog, qp, pp, u, v, j, e, Emax, E0)
144
145 501 1226 2.4 0.0 if s1 == 1 and bern(min(1, n1 * 1. / n)):
146 3 6 2.0 0.0 q = q1
147
148 501 773 1.5 0.0 n = n + n1
149
150 501 2031 4.1 0.1 span = qp - qn
151 501 9255 18.5 0.3 s = s1 * (span.dot(pn) >= 0) * (span.dot(pp) >= 0)
152 501 816 1.6 0.0 j = j + 1
153
154 500 1407 2.8 0.0 p = -p
155
156 500 1528 3.1 0.1 w = 1. / (self.m + self.t0)
157 500 1110 2.2 0.0 self.Hbar = (1 - w) * self.Hbar + w * \
158 500 1461 2.9 0.0 (self.target_accept - a * 1. / na)
159
160 500 3432 6.9 0.1 self.step_size = exp(self.u - (self.m**self.k / self.gamma) * self.Hbar)
161 500 975 1.9 0.0 self.m += 1
162
163 500 640 1.3 0.0 return q
Notably, everything seems to happen in C. I also can't find any calls to tensor.Split
as suggested in https://github.com/Theano/Theano/issues/3039.
Interesting! Could have been a different model I was running? Not sure.
Looks like Join was taking 8%. Maybe we could squeeze a little out of that if we thought about it? But that's not very big.
What does the call profiler+line profiler look like for buildtree? From the line profiler, I can't tell if all that time is just recursive time spent in compute_energy or if there's substantial unpacking cost.
I wonder if there's a way to potentially save on compute_energy. Its taking up ~30%, but won't the energy have been computed before (from the end of the last iteration)? I guess it should be cached already? Not sure if its actually getting cached. Actually, compute_energy seems to be taking both q and p, which is maybe a mistake, esp if it makes caching harder. We should maybe move the p part of the computation out.
I'm actually surprised that compute_energy is taking up 30% the first time its called since the energy is computed in leapfrog which is computed many times. Something unintuitive is going on there. You should post the profile for compute_energy.
Also, its possible we could save substantial compute by only ever computing differences in energies.
One way we could do that is by making the current energy function compute the energy relative to a dummy state (maybe the start state passed to NUTS at init). That seems kind of inelegant, but I think it would work. Not sure how much it would save.
How does this compare to a non-hierarchical model?
One other thing I found is that the auto-transform of models caused many of the stalling NUTS problems. For example, the stoch vol model worked quite well until https://github.com/pymc-devs/pymc3/commit/c3120bce05bf8f1389272e1c38ddf83cb46c8d84.
If you switch transform=None
sampling in NUTS is much much faster. This is the model I used to test between revisions:
import numpy as np
import pymc3 as pm
from pymc3.math import exp
from pymc3.distributions.timeseries import *
n = 100
returns = np.genfromtxt('../../../pymc3/examples/data/SP500.csv')[-n:]# get_data_file('pymc3.examples', "data/SP500.csv"))[-n:]
with pm.Model() as model:
sigma = pm.Exponential('sigma', 1. / .02, testval=.1)#, transform=None)
nu = pm.Exponential('nu', 1. / 10)#, transform=None)
s = GaussianRandomWalk('s', sigma ** -2, shape=n)
try:
r = pm.StudentT('r', nu, lam=exp(-2 * s), observed=returns)
except AttributeError:
r = pm.T('r', nu, lam=exp(-2 * s), observed=returns)
Trace with transforms:
Trace without transforms:
Well, that's depressing. Isn't that exactly the opposite of what we would expect?
It could be specific to the model. Maybe you can test that in the models that stopped working for you? It is suspicious that the values in the posterior are so different though.
Whoa, I didn't realize this. That seems bad
It also seems quite confusing.
@jsalvatier Can you take a closer look at this? Is it possible that due to the small numerical values of returns, the transformation warps the space too strongly?
I think the changed curvature could indeed be behind this. The sigma
test_value is set to 0.0138
(or -4.27
in log-space), which is quite small and far away from the true posterior. Changing it to 1
creates good convergence for the transformed and non-transformed model. If my intuition is correct, one would need to take huge step-sizes in log-space to change sigma
in a meaningful way.
@bob-carpenter Have you experienced issues like this before where a transform would lead to poorer convergence due to warping?
We haven't tried a lot of variations in Stan. You have to consider the Jacobian adjustment along with the gradient on the constrained scale to see the space you're really working in. I'd be very interested to see the effects of different transforms.
The biggest problem is getting close to zero---it's like a lognormal in that you can get close, but there's not actually support at zero.
If you have to cover a long range, it can be difficult to adapt or it can take many steps (i.e., it'll be slow).
The bigger problem is usually varying curvature---when there's not a single stepsize that can be optimized.
On Dec 21, 2016, at 8:16 AM, Thomas Wiecki notifications@github.com wrote:
I think the changed curvature could indeed be behind this. The sigma test_value is set to 0.0138 (or -4.27 in log-space), which is quite small and far away from the true posterior. If my intuition is correct, one would need to take huge step-sizes in log-space to change sigma in a meaningful way.
@bob-carpenter Have you experienced issues like this before where a transform would lead to poorer convergence due to warping?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks @bob-carpenter, that all makes sense.
I don't think the transforms really introduced any other problem than this. It shouldn't be a blocker.
Shouldn't MAP usually put us in a good location on the transformed space? You would think so, yet models tend to do poorly when initialized to the MAP estimate.
I'd love to see a study of how well this works compared to random inits. The mode's usually not even in the typical set of a complicated distribution (meaning you won't get close to the mode through sampling from the posterior, MCMC or otherwise). Even a multivariate normal once you get to more than a handful of dimensions doesn't include the mode in the typical set. And the mode's equal to the mean for a multivariate normal, so even the posterior mean isn't in the typical set. And initializing at the mode has the disadvantage of a relatively high density and a zero gradient---it's more this effect I want to understand.
The typical set is the set we want MCMC to sample from; because it covers almost all of the posterior probabilty mass, it's sufficient for calculating expectations: https://en.wikipedia.org/wiki/Typical_set
The whole goal of warmup is to get to the first effective draw---that is, to find the typical set.
I'm having the same problem using the code below. My dataset has 134 rows, 6 predictors, all predictors and response standardized.
with model_1:
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
trace_1 = pm.sample(2000, step, njobs=4)
Curiously it only gets slow for hierarchical models. Using glm
for a regular regression, pymc's sampling is done in a minute, whereas a hierarchical model shows ~4 hours.
I've had to revert to using metropolis. This runs fine (takes ~ 1 minute for the same hierarchical model)
with model_3:
step = pm.Metropolis()
map_estimate = pm.find_MAP()
trace_3 = pm.sample(4000, step, start=map_estimate, njobs=4, random_seed=42)
@nigeljyng I just wrote a blog post on this: http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/
tl;dr: just use pm.sample(4000, tune=1000)[1000:]
, do not define a step-method yourself.
Great blog post, @twiecki, thanks for writing this! I have a sneaking feeling this may explain at least part of the stalling/slowing issues I described here as well. We see this kind of funnel ("of hell") in many of our hierarchical models, which are already pretty slow in the best of cases due to the size of the model and/or dataset. @jake-westfall and I plan to explore using the offset parameterization as the default way of handling random effects in Bambi (we just opened an issue for this).
One suggestion: since the conjecture is that sampling may get (much) slower when the sampler happens to be in a region of space characterized by very small random effect sigmas, it would be helpful to be able to test this directly. As far as I can tell, pymc3 doesn't currently log the time taken to accept each sample, but adding such functionality would make it much easier to explore how sampling time varies as a function of the current parameter values.
@tyarkoni Great, I was actually thinking what you guys were doing in Bambi, glad it was useful.
https://github.com/pymc-devs/pymc3/pull/1687 actually adds functionality to check how deep the tree is created by nuts. Looking at these stats more carefully should provide some insights.
@twiecki I used the non-centered reparameterization for my varying effects and used pm.sample(4000, tune=1000)[1000:]
but NUTS still gets stuck when it hits 1% of sampling (at 40 iterations, time remaining goes up to 10 hours).
I had to resort to using trace_3 = pm.sample(10000, tune=1000, step=pm.Metropolis(), njobs=4)[1000:]
. But the model never converges (bad Gelman-Rubin Rhat values). Here's the joint posterior. It seems like it has a hard time exploring the space.
Any ideas? I know it might be difficult to diagnose without looking at the data. I can try to recreate a toy dataset and see if the problem persists.
@nigeljyng Something must be wrong with your model, I think.
@twiecki I ran the same model with PyStan
and it worked fine though, good looking chains and rhats. I'd much rather be using PyMC3
though. Will see if this happens again in future models and data.
@nigeljyng Hm, can you post the model? Preferably a self-contained NB.
EDIT: I'll post an NB in a bit so it's easier for you to work with.
@twiecki Found a silly typo in the model that made it reference a different (unstandardized) dataset. Your offset fix works fine now! Thanks for your time
@nigeljyng Great, glad it's working. So you're all set then?
@twiecki yep! Thanks for writing the blog post, that really helped the sampling.
Can we close this?
I think so. Closing...
I'm running the script below with 2,000 samples. It's surprisingly slow, a bit under 9s / sample under
Python 2.7.11 :: Anaconda custom (x86_64)
on a Mid-2015 MacBook Pro with 2.5GHz CPUs (four of them, but the sampler seems to be single-threaded.) I'd be grateful for any suggestions about how to speed it up.