pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.7k stars 2.01k forks source link

ADVI returns NaN with many features #1147

Closed parsing-science closed 8 years ago

parsing-science commented 8 years ago

Hi,

I've been testing out the new ADVI feature the last couple of days. If I only have a small number of features, it seems to produce good values, but if I increase the number of features, I keep getting NaNs.

Here's a basic logistic regression based on one of your examples:

def invlogit(x):
    return np.exp(x) / (1 + np.exp(x))

def tinvlogit(x):
    return T.exp(x) / (1 + T.exp(x))

# Set up basic parameters
num_features = 7
n = 100000

# Choose random values for the actual alpha and betas
alpha_a = random.normal(size=1)

betas_a = random.normal(size = num_features)

# Create fake predictor data
predictors = random.normal(size=(n, num_features))

# Calculate the outcomes
outcomes = random.binomial(1, invlogit(alpha_a + np.sum(betas_a[None, :] * predictors, 1)))

model = Model()

with model:
    alpha = Normal(b'alpha', mu=0, tau=2.**-2, shape=(1))
    betas = Normal(b'betas', mu=0, tau=2. ** -2, shape=(1, num_features))

    p = tinvlogit(alpha+sum(betas * predictors, 1))

    o = Bernoulli(b'o', p, observed=outcomes)

with model:
    mu, sds, elbo = variational.advi(n=10000)

This is the output I get:

Iteration 0 [0%]: ELBO = -183969.24
Iteration 1000 [10%]: ELBO = nan
Iteration 2000 [20%]: ELBO = nan
Iteration 3000 [30%]: ELBO = nan
Iteration 4000 [40%]: ELBO = nan
Iteration 5000 [50%]: ELBO = nan
Iteration 6000 [60%]: ELBO = nan
Iteration 7000 [70%]: ELBO = nan
Iteration 8000 [80%]: ELBO = nan
Iteration 9000 [90%]: ELBO = nan
Finished [100%]: ELBO = nan

As long as the number of features is fewer than 7, it seems to work ok, but as soon as I go up to 7 or more features, I only get NaNs back.

Any idea what could be happening? It seems very strange that just going from 6 to 7 features could produce this behavior.

fonnesbeck commented 8 years ago

Thanks for reporting this. Actually, I get nan lower bounds with 6 features as well.

fonnesbeck commented 8 years ago

I can push it a little larger using T.dot instead of summing, but it eventually produces nans as well. I suspect we are running into numerical issues.

fonnesbeck commented 8 years ago

Increasing the learning rate makes it work. Here is a run with 10 features and learning_rate=0.1:

# Set up basic parameters
num_features = 10
n = 100000

# Choose random values for the actual alpha and betas
alpha_a = random.normal(size=1)

betas_a = random.normal(size = num_features)

# Create fake predictor data
predictors = random.normal(size=(n, num_features))

# Calculate the outcomes
outcomes = random.binomial(1, invlogit(alpha_a + np.sum(betas_a[None, :] * predictors, 1)))

model = Model()

with model:
    alpha = Normal('alpha', mu=0, tau=2.**-2)
    betas = Normal('betas', mu=0, tau=2. ** -2, shape=num_features)

    p = tinvlogit(alpha+ T.dot(predictors, betas))

    o = Bernoulli('o', p, observed=outcomes)

with model:
    mu, sds, elbo = variational.advi(n=10000, learning_rate=1e-1)
Iteration 0 [0%]: ELBO = -146062.98
Iteration 1000 [10%]: ELBO = -32866.29
Iteration 2000 [20%]: ELBO = -32868.08
Iteration 3000 [30%]: ELBO = -32854.3
Iteration 4000 [40%]: ELBO = -32870.66
Iteration 5000 [50%]: ELBO = -32857.08
Iteration 6000 [60%]: ELBO = -32855.74
Iteration 7000 [70%]: ELBO = -32853.19
Iteration 8000 [80%]: ELBO = -32857.2
Iteration 9000 [90%]: ELBO = -32849.4
Finished [100%]: ELBO = -32843.18

print(mu)

{'alpha': array(-0.446285270790872), 'betas': array([-0.83721635,  0.88664962,  1.89889138, -0.8768909 ,  0.98864731,
       -0.84737285,  0.39511949,  1.53239494, -0.81602863, -0.92677491])}
twiecki commented 8 years ago

CC @taku-y who might also have insight into this.

parsing-science commented 8 years ago

Interesting. I tried running both ADVI and minibatch ADVI on my real data yesterday (~80 features, 60,000 samples), and I couldn't get anything useful out no matter how much I changed the parameters. (Of course, there could definitely been a bug in my code.)

fonnesbeck commented 8 years ago

We probably need ways of tuning the advi hyperparameters. Also, you can try running the PyMC3 model using Edward as the computational backend and see if you get anywhere with it. That should at least isolate whether it is an issue with our code or your model.

taku-y commented 8 years ago

I will check this in the weekend.

taku-y commented 8 years ago

I modified tinvlogit(). It works for me (https://gist.github.com/taku-y/058ac94f2ad4e8685c0681480b6f4747):

def tinvlogit(x):
    lower = 1e-6
    upper = 1 - 1e-6
    return lower + (upper - lower) * tt.exp(x) / (1 + tt.exp(x))

In the Bernoulli class, log (tinvlogit(alpha+ T.dot(predictors, betas))) is calculated. If the value inside the bracket is too small or too large, p becomes 0 or 1 and results in nan.

Samplers not using gradients (e.g., MH), samples having nan values may be just rejected. But I think for gradient based samplers like HMC or NUTS the same problem may happen.

twiecki commented 8 years ago

Awesome, thanks @taku-y! I wonder if we should add this function to pymc3.

fonnesbeck commented 8 years ago

We should definitely add logit and inverse-logit functions. That one is a little inefficient actually, as it calculates exp(x) twice. Using 1/(1+T.exp(-x)) is better.

fonnesbeck commented 8 years ago

Spoke too soon. Still broken.

shirishr commented 7 years ago

Hello all, I am a newbie so I am not sure if I have grasped what fix is in place and what is not. I am working on Dorothea dataset hosted at UCI. The data is 1150 observations with 100,000 sparse (binary) features. I too get NaNs for ELBO ever after feature selection down to 17 (from 100,000). Is there any trick I can employ? @taku-y @twiecki @fonnesbeck ? ?? ???

shirishr commented 7 years ago

FYI: Dorothea dataset and it's variation featured in two different competitions in 2001 and 2003. KDD 2001 and NIPS 2003. Winners of one of the competitions were Radford Neal & Zhang (Univ. of Toronto). They seem to have used BayesNN with Dirichlect Diffusion Trees. They discussed their technique employed on a number of datasets (Dorothea being one of 4 I think). Their documented discussion is attached here. Neal & Zhang used more than 500 features after employing ARD (I think). I am studying Machine Learning .. I shall be eternally grateful for any guidance you may provide ! Thank you all. Neal-Zhang.pdf