Closed parsing-science closed 8 years ago
Thanks for reporting this. Actually, I get nan
lower bounds with 6 features as well.
I can push it a little larger using T.dot
instead of summing, but it eventually produces nan
s as well. I suspect we are running into numerical issues.
Increasing the learning rate makes it work. Here is a run with 10 features and learning_rate=0.1
:
# Set up basic parameters
num_features = 10
n = 100000
# Choose random values for the actual alpha and betas
alpha_a = random.normal(size=1)
betas_a = random.normal(size = num_features)
# Create fake predictor data
predictors = random.normal(size=(n, num_features))
# Calculate the outcomes
outcomes = random.binomial(1, invlogit(alpha_a + np.sum(betas_a[None, :] * predictors, 1)))
model = Model()
with model:
alpha = Normal('alpha', mu=0, tau=2.**-2)
betas = Normal('betas', mu=0, tau=2. ** -2, shape=num_features)
p = tinvlogit(alpha+ T.dot(predictors, betas))
o = Bernoulli('o', p, observed=outcomes)
with model:
mu, sds, elbo = variational.advi(n=10000, learning_rate=1e-1)
Iteration 0 [0%]: ELBO = -146062.98
Iteration 1000 [10%]: ELBO = -32866.29
Iteration 2000 [20%]: ELBO = -32868.08
Iteration 3000 [30%]: ELBO = -32854.3
Iteration 4000 [40%]: ELBO = -32870.66
Iteration 5000 [50%]: ELBO = -32857.08
Iteration 6000 [60%]: ELBO = -32855.74
Iteration 7000 [70%]: ELBO = -32853.19
Iteration 8000 [80%]: ELBO = -32857.2
Iteration 9000 [90%]: ELBO = -32849.4
Finished [100%]: ELBO = -32843.18
print(mu)
{'alpha': array(-0.446285270790872), 'betas': array([-0.83721635, 0.88664962, 1.89889138, -0.8768909 , 0.98864731,
-0.84737285, 0.39511949, 1.53239494, -0.81602863, -0.92677491])}
CC @taku-y who might also have insight into this.
Interesting. I tried running both ADVI and minibatch ADVI on my real data yesterday (~80 features, 60,000 samples), and I couldn't get anything useful out no matter how much I changed the parameters. (Of course, there could definitely been a bug in my code.)
We probably need ways of tuning the advi hyperparameters. Also, you can try running the PyMC3 model using Edward as the computational backend and see if you get anywhere with it. That should at least isolate whether it is an issue with our code or your model.
I will check this in the weekend.
I modified tinvlogit(). It works for me (https://gist.github.com/taku-y/058ac94f2ad4e8685c0681480b6f4747):
def tinvlogit(x):
lower = 1e-6
upper = 1 - 1e-6
return lower + (upper - lower) * tt.exp(x) / (1 + tt.exp(x))
In the Bernoulli class, log (tinvlogit(alpha+ T.dot(predictors, betas))) is calculated. If the value inside the bracket is too small or too large, p becomes 0 or 1 and results in nan.
Samplers not using gradients (e.g., MH), samples having nan values may be just rejected. But I think for gradient based samplers like HMC or NUTS the same problem may happen.
Awesome, thanks @taku-y! I wonder if we should add this function to pymc3.
We should definitely add logit and inverse-logit functions. That one is a little inefficient actually, as it calculates exp(x) twice. Using 1/(1+T.exp(-x)) is better.
Spoke too soon. Still broken.
Hello all, I am a newbie so I am not sure if I have grasped what fix is in place and what is not. I am working on Dorothea dataset hosted at UCI. The data is 1150 observations with 100,000 sparse (binary) features. I too get NaNs for ELBO ever after feature selection down to 17 (from 100,000). Is there any trick I can employ? @taku-y @twiecki @fonnesbeck ? ?? ???
FYI: Dorothea dataset and it's variation featured in two different competitions in 2001 and 2003. KDD 2001 and NIPS 2003. Winners of one of the competitions were Radford Neal & Zhang (Univ. of Toronto). They seem to have used BayesNN with Dirichlect Diffusion Trees. They discussed their technique employed on a number of datasets (Dorothea being one of 4 I think). Their documented discussion is attached here. Neal & Zhang used more than 500 features after employing ARD (I think). I am studying Machine Learning .. I shall be eternally grateful for any guidance you may provide ! Thank you all. Neal-Zhang.pdf
Hi,
I've been testing out the new ADVI feature the last couple of days. If I only have a small number of features, it seems to produce good values, but if I increase the number of features, I keep getting NaNs.
Here's a basic logistic regression based on one of your examples:
This is the output I get:
As long as the number of features is fewer than 7, it seems to work ok, but as soon as I go up to 7 or more features, I only get NaNs back.
Any idea what could be happening? It seems very strange that just going from 6 to 7 features could produce this behavior.