Closed rafaljanwojcik closed 4 years ago
Hello Rafal, and thank you for using BLiTZ, giving me such a feedback and presenting this issue.
Can you provide more details of your network? (the parameters passed to the constructor of each bayesian layer and maybe the whole network class, so I can try it here and see how can I help you).
On a first sight, that might me related to the prior distribution parameters set, but I would like to be sure of it before taking a conclusion.
Thank you. -Pi.
Thanks a lot for responding! Here is the code with my model (it is based on your example with boston dataset), normally I train it on my dataset, but I get the same error on boston so I used it here:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from tqdm.notebook import tqdm
from blitz.modules import BayesianLinear
from blitz.utils import variational_estimator
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
X = StandardScaler().fit_transform(X)
y = StandardScaler().fit_transform(np.expand_dims(y, -1))
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=.25,
random_state=42)
X_train, y_train = torch.tensor(X_train).float(), torch.tensor(y_train).float()
X_test, y_test = torch.tensor(X_test).float(), torch.tensor(y_test).float()
@variational_estimator
class BayesianRegressor(nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.blinear1 = BayesianLinear(input_dim, 512)
self.relu = nn.ReLU()
self.blinear2 = BayesianLinear(512, 256)
self.relu2 = nn.ReLU()
self.blinear3 = BayesianLinear(256, output_dim)
def forward(self, x):
x = self.blinear1(x)
x = self.relu(x)
x = self.blinear2(x)
x = self.relu2(x)
return self.blinear3(x)
def evaluate_regression(regressor,
X,
y,
samples = 100,
std_multiplier = 2):
preds = [regressor(X) for i in range(samples)]
preds = torch.stack(preds)
means = preds.mean(axis=0)
stds = preds.std(axis=0)
ci_upper = means + (std_multiplier * stds)
ci_lower = means - (std_multiplier * stds)
ic_acc = (ci_lower <= y) * (ci_upper >= y)
ic_acc = ic_acc.float().mean()
return ic_acc, (ci_upper >= y).float().mean(), (ci_lower <= y).float().mean()
And here is the code for training:
device = torch.device('cpu')
gradient_clipping_norm = 1.25
regressor = BayesianRegressor(13, 1)
model_copy = copy.deepcopy(regressor)
regressor = regressor.to(device)
model_copy = model_copy.to(device)
optimizer = optim.Adam(regressor.parameters(), lr=0.01)
optimizer_copy = optim.Adam(regressor.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()
ds_train = torch.utils.data.TensorDataset(X_train, y_train)
dataloader_train = torch.utils.data.DataLoader(ds_train, batch_size=16, shuffle=True)
ds_test = torch.utils.data.TensorDataset(X_test, y_test)
dataloader_test = torch.utils.data.DataLoader(ds_test, batch_size=16, shuffle=False)
for epoch in tqdm(range(100)):
epoch_losses = []
for i, (datapoints, labels) in enumerate(dataloader_train):
optimizer_copy.load_state_dict(optimizer.state_dict())
optimizer.zero_grad()
loss = regressor.sample_elbo(inputs=datapoints.to(device),
labels=labels.to(device),
criterion=criterion,
sample_nbr=3,
complexity_cost_weight=1/X_train.shape[0])
if str(loss.item()) == 'nan':
print(loss.item())
raise ValueError('loss in trainig loop went to nan - check parameters')
mp = list(regressor.parameters())
mcp = list(model_copy.parameters())
n = len(mp)
for i in range(0, n):
mcp[i].data[:] = mp[i].data[:]
loss.backward()
optimizer.step()
epoch_losses.append(loss.cpu().detach().item())
mean_loss = np.mean(epoch_losses)
with torch.no_grad():
ic_acc, under_ci_upper, over_ci_lower = evaluate_regression(regressor,
X_test.to(device),
y_test.to(device),
samples=25,
std_multiplier=1)
print("Epoch: ", epoch)
print("CI acc: {:.2f}, CI upper acc: {:.2f}, CI lower acc: {:.2f}".format(ic_acc, under_ci_upper, over_ci_lower))
print("Loss: {:.4f}".format(mean_loss))
Code for model_copy and optimizer_copy is to investigate state of the model from the moment before the nans appeared. The training breaks usually because of nans around 40-80 epoch. If you run code below, sometimes it returns inf
, and all the code for further investigations is in the previous post :)
new_loss = model_copy.sample_elbo(inputs=datapoints.to(device),
labels=labels.to(device),
criterion=criterion,
sample_nbr=3,
complexity_cost_weight=1/X_train.shape[0])
new_loss
PS: Sometimes it's other than blinear1 layer that has -inf in log_prior attribute, so in code:
first_layer = list(model_copy.modules())[0].blinear1
first_layer.log_prior # returns -inf
it can also be blinear2 or blinear3
Hello, a little update:
I've found that sometimes weights sampled from posterior distributions (with GaussianVariational weight sampler in bayesian layer) are so far away from their prior distribution (with default settings - mean 0, variance 0.1, and the sampled weight value is equal to e.g. 1.4537
), that their log_probability of being sampled from prior distribution is equal to ~-104
and when such small value is passed through torch.exp() it returns 0.0
- this is the reason for inf
values later on in log_prior values of layers and breaking of the training. I managed to solve this issue by increasing sigma of prior distribution, but do you suggest any other way of doing it?
Thanks again for responding so quickly! 🙂
Hello and sorry for the late reply. On your NaN case, you should check if the NaN is coming from the log likelihood relative of the weights relative to the prior distribution. If that's the case, then you should tune the parameters of that prior dsitribution. Note that you can use a gaussian mixture model too.
On that case of the torch.exp returning 0
you can sum some very small number, as 10e-6
to avoid zeros on the log prob. You can also clip the complexity cost of the function on the first few iterations or multiply it by some discount factor, so it does not explode.
If you are getting NaNs on the fitting cost, then either the problem is on the lib (in that I case I will try to fix it) or on the data, loss function, etc...
Hope this is useful.
Yeah, I managed to solve this issue by increasing variance of prior mixture distribution - thank you for your answer!
Hello, first of all amazing work, and thank you for this project! I'm trying to train simple 3-layered NN and I encountered some problems I wanted to ask about. Here is my model:
I'm training it on dataset with prices of flats/houses I recently scraped, and I've encountered problem I cannot seem to fully understand: after a few epochs, loss returned by the model.sample_elbo method is sometimes equal to nan, which when backpropagated breaks the whole training, as some of the weights are 'optimized' to nans:
I managed to track down where the incorrect values appears first, before backpropagation of these nans, and it turned out that value of log_prior in first bayesian layer is sometimes equal to -inf
Going further I checked that the problem is in weight_prior_dist, which sometimes, like one in 5 times returns -inf:
Going deeper I realised, that the problem is in prior_pdf of first prior distribution in weight_prior_dist of first layer. Some of logarithms of probabilities for the sampled values of weights (
prior_dist.dist1.log_prob(w)
) are very small, equal to ~-100, and when passed through torch.exp such small values are approximated to 0. When these 0-weights go through torch.log inprior_dist.log_prior(w)
they are equal to -inf, and the whole mean approaches then -inf, which corrupts further calculations of loss:If I understand correctly, it means that the probabilities of such sampled weights for prior distribution are very very small, approaching zero, but could you suggest me the way of tackling this problem somehow, so they remain very small, and not zero? Or maybe the problem is different? I'm still learning details of Bayesian DL, so I hope there aren't so many silly mistakes, and thank you for any kind of help! best regards Rafał