Closed espears4sq closed 6 years ago
My take is that with such large number of observation and dimension of parameters, the surface of the logp is just a sharp peak around the MAP/MLE, which makes gradient estimation very difficult and fail consistently (the gradient is near zero everywhere but extreamly large around MAP/MLE). Not sure what is a good solution here, I would probably just settle with a point estimation.
@junpenglao Thanks for your message.
One reason I am seeking posterior samples is that we have a number of diagnostics where we'd like posterior uncertainty metrics. So from a sample of ~ 1000 different parameter vectors, I could simulate outcome data and calculate a posterior distribution over whatever outcome statistics. The simplest would just be posterior order statistics over the parameter values themselves, but I have many more complicated use cases, like posterior uncertainty in the fraction of positive outcomes conditional on certain characteristics in the feature vector, etc.
The current production use case does rely only on a point estimate with standard MLE optimization. But my specific experimentation with pymc is to address posterior uncertainty questions that cannot be addressed adequately if we only have access to a point estimate of the parameters.
I also wonder if based on your comment, there would be a way to use an extremely tiny step size or energy threshold for plain Metropolis, so it's unaffected by the gradient issue.
The trouble is even if I use plain Metropolis, the sampling time is unacceptably slow (it does give accurate results on a subsample of the data though, just too slowly).
Metropolis doesnt work in high dimension (eg, see blogpost from @ColCarroll). I think it would be great to get NUTS working, but if it is really difficult as the observed being too large, I would use a minibatch ADVI instead.
@junpenglao I agree Metropolis would require an insane amount of simple samples to converge.
However, even when I tried mini-batch NUTS and ADVI, it does not work. I get the "bad energy" in initial state problem constantly unless I use find_MAP
as the start, and no matter what experiments I have tried with ADVI settings, when I try using ADVI through pm.fit
instead of directly sampling, the parameters snap to NaN immediately, I think because ADVI simply can't cope with hundreds of parameters like this. There is just always underflow / overflow issues.
But I am very open to more experiments to try to figure it out.
I would scale the sd to 1 in the prior:
beta = pm.Normal('beta', 0.0, sd=3.0, shape=500)
as it is likely not very meaningful to have coefficients being a large or small value - recall that the sigmoid squeeze all the value, and the gradient is very small in the tail which gives you lots of problems in this case.
That is a good observation and it's probably a good idea for me to change that. However, even after changing this, when I use NUTS (with start set to the zero vector) then ADVI takes ~ 5 minutes for the initialization and the subsequent NUTS sampling rate is around 1 sample every 4-5 seconds, looking like it would take over an hour for only 1000 samples (assuming convergence is good enough with that many samples).
NUTS also appears to go through periods when the time per sample jumps up as high as 30-40 seconds per sample for short periods. It's this slow sampling that I am trying to diagnose to understand if I can make NUTS amenable to this problem.
What about using NUTS with the default initialization? ie directly calling trace = pm.sample(1000, tune=1000)
.
Presumably if the tuning is doing its job you should see speed up after the initial ~500 samples
I am trying some experiments with different NUTS initializers. But the problem seems to be that even for the initialization, it takes an incredible amount of time.
I am not seeing any speed-up after the number of initialization samples is exceeded, and in fact I only see NUTS slowing down (sometimes significantly) as it goes, from 4-5 seconds per draw up to 40+ seconds per draw in the middle of sampling.
I encourage you to try it too with the example data set code I gave in the original post here. This is reproducible just by generating a ~ 1.5 million X 500 design matrix and corresponding logistic outcome labels.
Also, if I try using ADVI in any capacity whatsoever it always results in NaN for the parameter vector components. I've tried everything to figure that out, but nothing seems to work. Even if I only look at ~ 1000 observations but still 500 features, ADVI still constantly produces NaN. It seems it just is not able to handle large parameter vectors.
The expression (1./ (1. + exp(-x))
is bad numerically. Just use the build-in logit_p argument of pm.Bernoulli
:
with pm.Model() as logistic_model:
beta = pm.Normal('beta', 0.0, sd=3.0, shape=500)
logit_p = tt.dot(covariates, beta)
pm.Bernoulli('likelihood', logit_p=logit_p, observed=observed_labels)
with logistic_model:
trace = pm.sample(1000, njobs=2, nchains=2)
With this code, it seems to sample:
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [beta]
Sampling 2 chains: 6%|▌ | 185/3000 [07:48<3:45:45, 4.81s/draws]
This is on quite a beefy machine (20 cores), using numpy from anaconda, so that we use MKL for the dot product (that is running in parallel). It is possible that the posterior is highly correlated, in that case dense mass matrix adaptation would be nice (I might add that to pymc3 at some point, but there isn't much call for it).
When you have that much IID data for so few parameters, you could probably also use the Bernstein-von Mises Theorem and compute the MAP and hessian. I guess you could also use the hessian as mass matrix. That way you'd still get the actual posterior, but the hessian should deal with the correlations of the posterior, making things easier for the sampler even without dense mass matrix adaptation.
@aseyboldt Thank you for the reply. I think your suggestion about logit_p
is great and is definitely better than a naive inverse logistic function.
However, the sampling example you pasted (around 4-5 seconds per draw) is typical of what I was already seeing with NUTS (although I have always had to provide either the zero vector or find_MAP()
as the start -- if I let it choose a random start parameter vector, I get a "bad energy" error every time).
So it means I don't think the change to logit_p
is improving sampling or speed here. It's still too slow to be possible for the use case. The other thing to note is that the sampling time degrades badly as the iterations go on. After a few hundred iterations, I start to see the time per sample go from 4-5 seconds per draw up to 40+ seconds per draw, and sometimes these times oscillate wildly.
By comparison, many packages converge to a good MLE point estimate in under 10 minutes. Maybe I can increase that time frame to 20 or 30 minutes, but if it takes more than 30 minutes to generate a large set of good posterior samples, then the sampling approach would not be tractable for this use case (which is a bummer, since posterior inference would be much better for this problem than settling for a point estimate).
The goal was not to merely get NUTS working. Rather to be able to get a converged sample of ~1000 draws in around 30 minutes with this data dimensionality. I confess I am surprised it's not straightforward for this model.
I don't think it will work in only 30 minutes on a CPU. Sampling is just slower than optimization (it is solving a more difficult problem after all). But you could try out closed form approximations to the posterior (like the MAP/hessian thing), and validate that with draws from the posterior on some datasets. You could also try to use GPUs for this (you can just store the covariates on the GPU, so you don't need to copy all that much).
Oh, or advi with a dense covariance matrix.
I'm not sure that in a total complexity sense, the task of optimizing for the MLE and then solving the matrix inversion for the covariance matrix for standard errors isn't comparable to sampling, for a simple posterior like this.
If I scale down the number of observations (which is a linear effect on runtime of the optimizers), it is suitably faster, even given the large number of feature components.
But NUTS seems almost independent of the number of observations, and instead exhibits far more severe slowdown as the number of features are increased (the dimensionality of the parameter space). Whereas I would have expected that, especially near the MAP estimate, the gradient calculations are trivial and the data size (1.5 MM by 500) shouldn't matter. But even dropping the 1.5MM down to ~ 100,000, I still see that NUTS is extremely slow to sample (5+ seconds per sample, with oscillations up to 40+ seconds per sample at times).
What makes matters worse is that I can't even get a pilot sample to assess convergence or accuracy against the true coefficients, since even sampling on a smaller sub-sample is too slow. I could end up spending ~ 3 hours to generate 1000 samples from NUTS on the full data, only to find convergence was poor or something else was wrong.
At the moment, GPU is not an option for the use case given other constraints. I have tried ADVI but have not been able to get it to work, e.g. see here.
For other comparisons of problem scale, you can imagine a simple logistic model like this is like a 1-layer neural network with Gaussian priors on the weights and a binary outcome. 1.5 MM observations and 500 features would be a small instance size for a problem like this, even for most modern consumer laptops. Using Keras, you would not at all expect such a single-layer network to take hours to converge with SGD.
I understand NUTS is more computationally costly. I am just surprised for this model it is this much more costly. It suggests some type of pathological complexity growth in the dimensionality of the parameter vector (which maybe is well known to NUTS experts already?).
As I feared, even letting the job run over night to generate ~3000 NUTS samples (took a little over 5 hours, averaging out to around 6 seconds per sample), I am finding very poor convergence with NUTS and the resulting posterior mean and median are nowhere close to the true parameter vector for the toy data (but optimizer solutions do quickly find an accurate solution for the point estimate).
In this case, I get better quality inference just using Metropolis on a subsample of the data. It samples so much faster that I can generate many more (>50,000) samples in the same amount of time, and with a reasonable burn-in choice and thinning choice, the posterior estimates are reasonable. The disconcerting thing is that Metropolis too suffers a big slowdown when using the full data set, but I can't see a reason why it should. Evaluating the likelihood function at a new proposal of the parameter vector, even for 1.5 million observations, should be extremely fast, and the proposal itself will be generated from a multivariate Normal with assumed homoscedastic covariance matrix. I can't see what would create substantial slowdown in the actual mechanics of the Metropolis algorithm. (I agree there can be overall slowdown due to the requirement for a much greater number of samples, but just puzzled by the per-sample slowdown in the Metropolis case).
I did a bit more digging, and it seems the posterior correlations are a problem for nuts:
with pm.Model() as model:
intercept = pm.Normal('intercept', sd=3)
beta = pm.Normal('beta', sd=2, shape=n_features)
pm.Bernoulli('likelihood',
logit_p=intercept + tt.dot(covariates, beta),
observed=observed_labels)
trace = pm.sample()
cov = np.cov((trace['beta'][:, np.argsort(true_coefficients)]).T)
corr = cov.copy()
stds = np.sqrt(np.diag(cov))
corr[...] /= stds[None, :]
corr[...] /= stds[:, None]
cluster = sns.heatmap(corr, center=0, vmax=1, vmin=-1)
This looks like something that should be solvable with a reparametrization, but so far I don't know how.
The correlations go away if we rescale by the std of beta:
cov = np.cov((trace['beta'] / trace['beta'].std(1)[:, None])[:, np.argsort(true_coefficients)].T)
corr = cov.copy()
stds = np.sqrt(np.diag(cov))
corr[...] /= stds[None, :]
corr[...] /= stds[:, None]
cluster = sns.heatmap(corr, center=0, vmax=1, vmin=-1)
@aseyboldt Thank you so much for these follow-up comments. It gave me an idea for reparameterizing the model. I don't have a good idea for why this is working, but so far it seems to be working.
with pm.Model() as logistic_model:
mu = pm.Normal('mu', 0.0, sd=1)
sd = pm.HalfCauchy('sd', 1.0)
beta = pm.Normal('beta', mu, sd=sd, shape=num_coefficients)
logit = tt.dot(data[:, 1:], beta)
likelihood = pm.Bernoulli('likelihood', logit_p=logit, observed=data[:, 0])
I reparameterized to introduce hyperpriors on the mean and standard deviation of the distribution that the beta coefficients will be drawn from. I was inspired to do it based on several of the blog posts and things about "non-centered" models for pesky convergence issues with large hierarchical models, where you introduce a hyperprior for the offset from a group average and standard deviation.
On a sub-sample of my data, I can now just use the default NUTS sampler with no adjustments apart from giving a start value to avoid initial "bad energy" errors. The sampling is about 3 samples per second during the tuning stages, and then improves to about 6 samples per second after tuning. I sampled ~3000 draws in about 14 minutes on a small subset of data, which is waaay more in line with what I expected.
The posterior inference is high-quality as well, with the posterior mean vector being a good estimate of the true parameter vector and the posterior standard deviations reflecting realistic uncertainty values.
Now I am running some experiments on the full set of data. If you have any advice on why this reparameterization would help, or if I am making a silly methodological mistake by approaching it this way, please let me know. Or if you can think of even more useful reparameterizations that could help with speed in the larger data sample.
I'm going to close this as it's not a pymc3 issue, feel free to keep the discussion going however. @espears4sq for future questions please create a discourse account.
Apologies if this is not the right forum for this type of question. I do not have a discourse account, so I was unable to post there.
I am considering pymc for a high-dimensional logistic regression where posterior inference would be helpful. Currently I have an sklearn-based implementation using the built-in
LogisticRegression
and it converges quickly (matter of a few minutes) with an accurate solution on a toy data problem that has the same dimensionality as my real problem. (But, of course, only gives me the point estimate of the coefficient vector).With pymc, I have tried everything I can think of to experiment with tuning the standard NUTS sampler and nothing seems to work. The inference is extremely slow making it untenable to generate enough samples for convergence. Even with a plain Metropolis sampler, it is still quite slow and because many more samples are needed for convergence, it too isn't working.
I've tried using the auto-tuning default ADVI initialization, starting from the zero vector, starting from the MAP estimate, changing from 'leapfrog' to 'two-stage', using
Minibatch
, and lots of other things.The dimensionality for my problem is ~ 1.5 million observations, each with a 500-component feature vector. The feature vector has been studied for a long time and all of the features are included for specific inference goals. Many are categorical or binary. And I want to stress that I already have a working version of this model with sklearn, used by several teams, and with known-sane results for the estimated coefficient vector, but I have a need for certain model diagnostics that would be greatly helped by posterior inference.
Here is the code for generating a toy data sample:
When I use
sklearn
to fit a simpleLogisticRegression
to the above data, it fits the model in under 10 minutes and recovers a good estimate oftrue_coefficients
.In pymc, however, every method I have tried has bad convergence issues or errors.
Here is the code for making the simple pymc logistic regression:
The sklearn code for comparing:
With NUTS, I almost always get a "bad initial energy" error unless I provide my own start value either with pm.find_MAP() or setting it to the zero vector.
With pm.Metropolis(), everything works fine except that for large data it is extremely slow and even for e.g. ~ 20000 samples, which takes well over an hour, there are warnings about lack of convergence, and the posterior mean for the coefficients is nowhere close to the true values.
This is a very, very simple model with an extremely simple log-likelihood function. As a result, I would expect that NUTS should sample from the posterior extremely efficiently regardless of the sample size or the number of covariates (which, even at 500, is not very large).
Even if I downscale the toy data set to 1500 observations instead of 1500000, I am still seeing extremely poor sampling speed by default NUTS, 3-4 draws/s. This process does produce reasonable (but not great) posterior mean values for the coefficients, so it makes me feel that NUTS should work on this problem except that it faces a strange slowdown as the sample size is increased.
What can I do to diagnose how to best use NUTS or Metropolis to sample from a model fitted to a data set of this dimensionality?
Versions and main components