-inf in log_prior and nan in loss breaks training

rafaljanwojcik commented 4 years ago

Hello, first of all amazing work, and thank you for this project! I'm trying to train simple 3-layered NN and I encountered some problems I wanted to ask about. Here is my model:

BayesianRegressor(
  (blinear1): BayesianLinear(
    (weight_sampler): GaussianVariational()
    (bias_sampler): GaussianVariational()
    (weight_prior_dist): ScaleMixturePrior()
    (bias_prior_dist): ScaleMixturePrior()
  )
  (relu): ReLU()
  (blinear2): BayesianLinear(
    (weight_sampler): GaussianVariational()
    (bias_sampler): GaussianVariational()
    (weight_prior_dist): ScaleMixturePrior()
    (bias_prior_dist): ScaleMixturePrior()
  )
  (relu2): ReLU()
  (blinear3): BayesianLinear(
    (weight_sampler): GaussianVariational()
    (bias_sampler): GaussianVariational()
    (weight_prior_dist): ScaleMixturePrior()
    (bias_prior_dist): ScaleMixturePrior()
  )
)

I'm training it on dataset with prices of flats/houses I recently scraped, and I've encountered problem I cannot seem to fully understand: after a few epochs, loss returned by the model.sample_elbo method is sometimes equal to nan, which when backpropagated breaks the whole training, as some of the weights are 'optimized' to nans:

model_copy.sample_elbo(inputs=datapoints.to(device),
                       labels=labels.to(device),
                       criterion=criterion,
                       sample_nbr=3,
                       complexity_cost_weight=1/X_train.shape[0])

I managed to track down where the incorrect values appears first, before backpropagation of these nans, and it turned out that value of log_prior in first bayesian layer is sometimes equal to -inf

first_layer = list(model_copy.modules())[0].blinear1
first_layer .log_prior # returns -inf

Going further I checked that the problem is in weight_prior_dist, which sometimes, like one in 5 times returns -inf:

w =first_layer.weight_sampler.sample() #sampled weigths
prior_dist = first_layer.weight_prior_dist 
print(prior_dist.log_prior(w)) #sometimes returns -inf

Going deeper I realised, that the problem is in prior_pdf of first prior distribution in weight_prior_dist of first layer. Some of logarithms of probabilities for the sampled values of weights (prior_dist.dist1.log_prob(w)) are very small, equal to ~-100, and when passed through torch.exp such small values are approximated to 0. When these 0-weights go through torch.log in prior_dist.log_prior(w) they are equal to -inf, and the whole mean approaches then -inf, which corrupts further calculations of loss:

prob_n1 = torch.exp(prior_dist.dist1.log_prob(w)) # minimal value of this tensor is equal to 0
if prior_dist.dist2 is not None:
    prob_n2 = torch.exp(prior_dist.dist2.log_prob(w))

prior_pdf = (prior_dist.pi * prob_n1 + (1 - prior_dist.pi) * prob_n2) # minimal value of this tensor is equal to 0
(torch.log(prior_pdf)).mean() #formula for calculating log_prior of weight_prior_dist, returns -inf

If I understand correctly, it means that the probabilities of such sampled weights for prior distribution are very very small, approaching zero, but could you suggest me the way of tackling this problem somehow, so they remain very small, and not zero? Or maybe the problem is different? I'm still learning details of Bayesian DL, so I hope there aren't so many silly mistakes, and thank you for any kind of help! best regards Rafał

piEsposito commented 4 years ago

Hello Rafal, and thank you for using BLiTZ, giving me such a feedback and presenting this issue.

Can you provide more details of your network? (the parameters passed to the constructor of each bayesian layer and maybe the whole network class, so I can try it here and see how can I help you).

On a first sight, that might me related to the prior distribution parameters set, but I would like to be sure of it before taking a conclusion.

Thank you. -Pi.

rafaljanwojcik commented 4 years ago

Thanks a lot for responding! Here is the code with my model (it is based on your example with boston dataset), normally I train it on my dataset, but I get the same error on boston so I used it here:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from tqdm.notebook import tqdm

from blitz.modules import BayesianLinear
from blitz.utils import variational_estimator

from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)
X = StandardScaler().fit_transform(X)
y = StandardScaler().fit_transform(np.expand_dims(y, -1))

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=.25,
                                                    random_state=42)

X_train, y_train = torch.tensor(X_train).float(), torch.tensor(y_train).float()
X_test, y_test = torch.tensor(X_test).float(), torch.tensor(y_test).float()

@variational_estimator
class BayesianRegressor(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.blinear1 = BayesianLinear(input_dim, 512)
        self.relu = nn.ReLU()
        self.blinear2 = BayesianLinear(512, 256)
        self.relu2 = nn.ReLU()
        self.blinear3 = BayesianLinear(256, output_dim)

    def forward(self, x):
        x = self.blinear1(x)
        x = self.relu(x)
        x = self.blinear2(x)
        x = self.relu2(x)
        return self.blinear3(x)

def evaluate_regression(regressor,
                        X,
                        y,
                        samples = 100,
                        std_multiplier = 2):
    preds = [regressor(X) for i in range(samples)]
    preds = torch.stack(preds)
    means = preds.mean(axis=0)
    stds = preds.std(axis=0)
    ci_upper = means + (std_multiplier * stds)
    ci_lower = means - (std_multiplier * stds)
    ic_acc = (ci_lower <= y) * (ci_upper >= y)
    ic_acc = ic_acc.float().mean()
    return ic_acc, (ci_upper >= y).float().mean(), (ci_lower <= y).float().mean()

And here is the code for training:

device = torch.device('cpu')
gradient_clipping_norm = 1.25

regressor = BayesianRegressor(13, 1)
model_copy = copy.deepcopy(regressor)
regressor = regressor.to(device)
model_copy = model_copy.to(device)

optimizer = optim.Adam(regressor.parameters(), lr=0.01)
optimizer_copy = optim.Adam(regressor.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()

ds_train = torch.utils.data.TensorDataset(X_train, y_train)
dataloader_train = torch.utils.data.DataLoader(ds_train, batch_size=16, shuffle=True)

ds_test = torch.utils.data.TensorDataset(X_test, y_test)
dataloader_test = torch.utils.data.DataLoader(ds_test, batch_size=16, shuffle=False)

for epoch in tqdm(range(100)):
    epoch_losses = []
    for i, (datapoints, labels) in enumerate(dataloader_train):
        optimizer_copy.load_state_dict(optimizer.state_dict())

        optimizer.zero_grad()

        loss = regressor.sample_elbo(inputs=datapoints.to(device),
                           labels=labels.to(device),
                           criterion=criterion,
                           sample_nbr=3,
                           complexity_cost_weight=1/X_train.shape[0])

        if str(loss.item()) == 'nan':
            print(loss.item())
            raise ValueError('loss in trainig loop went to nan - check parameters')

        mp = list(regressor.parameters())
        mcp = list(model_copy.parameters())
        n = len(mp)
        for i in range(0, n):
            mcp[i].data[:] = mp[i].data[:]

        loss.backward()
        optimizer.step()

        epoch_losses.append(loss.cpu().detach().item())

    mean_loss = np.mean(epoch_losses)

    with torch.no_grad():
        ic_acc, under_ci_upper, over_ci_lower = evaluate_regression(regressor,
                                                                    X_test.to(device),
                                                                    y_test.to(device),
                                                                    samples=25,
                                                                    std_multiplier=1)
        print("Epoch: ", epoch)
        print("CI acc: {:.2f}, CI upper acc: {:.2f}, CI lower acc: {:.2f}".format(ic_acc, under_ci_upper, over_ci_lower))
        print("Loss: {:.4f}".format(mean_loss))

Code for model_copy and optimizer_copy is to investigate state of the model from the moment before the nans appeared. The training breaks usually because of nans around 40-80 epoch. If you run code below, sometimes it returns inf, and all the code for further investigations is in the previous post :)

new_loss = model_copy.sample_elbo(inputs=datapoints.to(device),
                       labels=labels.to(device),
                       criterion=criterion,
                       sample_nbr=3,
                       complexity_cost_weight=1/X_train.shape[0])
new_loss

rafaljanwojcik commented 4 years ago

PS: Sometimes it's other than blinear1 layer that has -inf in log_prior attribute, so in code:

first_layer = list(model_copy.modules())[0].blinear1
first_layer.log_prior # returns -inf

it can also be blinear2 or blinear3

rafaljanwojcik commented 4 years ago

Hello, a little update: I've found that sometimes weights sampled from posterior distributions (with GaussianVariational weight sampler in bayesian layer) are so far away from their prior distribution (with default settings - mean 0, variance 0.1, and the sampled weight value is equal to e.g. 1.4537), that their log_probability of being sampled from prior distribution is equal to ~-104 and when such small value is passed through torch.exp() it returns 0.0 - this is the reason for inf values later on in log_prior values of layers and breaking of the training. I managed to solve this issue by increasing sigma of prior distribution, but do you suggest any other way of doing it? Thanks again for responding so quickly! 🙂

piEsposito commented 4 years ago

Hello and sorry for the late reply. On your NaN case, you should check if the NaN is coming from the log likelihood relative of the weights relative to the prior distribution. If that's the case, then you should tune the parameters of that prior dsitribution. Note that you can use a gaussian mixture model too.

On that case of the torch.exp returning 0 you can sum some very small number, as 10e-6 to avoid zeros on the log prob. You can also clip the complexity cost of the function on the first few iterations or multiply it by some discount factor, so it does not explode.

If you are getting NaNs on the fitting cost, then either the problem is on the lib (in that I case I will try to fix it) or on the data, loss function, etc...

Hope this is useful.

rafaljanwojcik commented 4 years ago

Yeah, I managed to solve this issue by increasing variance of prior mixture distribution - thank you for your answer!

piEsposito / blitz-bayesian-deep-learning

-inf in log_prior and nan in loss breaks training #43