ValueError: The value argument must be within the support

victor-yon commented 3 years ago

I found a weird bug, I was able to reproduce it with the bayesian_LeNet_mnist example by reducing the number of parameters.

I have no idea why but based on my tests it occurs randomly during the training (though with the same seed it always trigger at the same iteration). And I was able to reproduce it only with small networks, if I add more neurons or more layers the problem never happen.

Error traceback

Traceback (most recent call last):
  File "networks/test.py", line 66, in <module>
    main()
  File "networks/test.py", line 40, in main
    loss = classifier.sample_elbo(inputs=datapoints.to(device),
  File "venv/lib/python3.8/site-packages/blitz/utils/variational_estimator.py", line 65, in sample_elbo
    outputs = self(inputs)
  File "venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "networks/test.py", line 28, in forward
    out = self.fc3(out)
  File "venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "venv/lib/python3.8/site-packages/blitz/modules/linear_bayesian_layer.py", line 93, in forward
    self.log_prior = self.weight_prior_dist.log_prior(w) + b_log_prior
  File "venv/lib/python3.8/site-packages/blitz/modules/weight_sampler.py", line 84, in log_prior
    prob_n1 = torch.exp(self.dist1.log_prob(w))
  File "venv/lib/python3.8/site-packages/torch/distributions/normal.py", line 73, in log_prob
    self._validate_sample(value)
  File "venv/lib/python3.8/site-packages/torch/distributions/distribution.py", line 277, in _validate_sample
    raise ValueError('The value argument must be within the support')
ValueError: The value argument must be within the support

Code

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.datasets as dsets
import torchvision.transforms as transforms
from blitz.modules import BayesianConv2d, BayesianLinear
from blitz.utils import variational_estimator

def main():
    train_dataset = dsets.MNIST(root="./cache", train=True, transform=transforms.ToTensor(), download=True)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)

    test_dataset = dsets.MNIST(root="./cache", train=False, transform=transforms.ToTensor(), download=True)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=64, shuffle=True)

    @variational_estimator
    class BayesianCNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc2 = BayesianLinear(784, 10)
            self.fc3 = BayesianLinear(10, 10)

        def forward(self, x):
            out = x.view(x.size(0), -1)
            out = F.relu(self.fc2(out))
            out = self.fc3(out)
            return out

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    classifier = BayesianCNN().to(device)
    optimizer = optim.Adam(classifier.parameters(), lr=0.001)
    criterion = torch.nn.CrossEntropyLoss()

    iteration = 0
    for epoch in range(100):
        for i, (datapoints, labels) in enumerate(train_loader):
            optimizer.zero_grad()
            loss = classifier.sample_elbo(inputs=datapoints.to(device),
                                          labels=labels.to(device),
                                          criterion=criterion,
                                          sample_nbr=3,
                                          complexity_cost_weight=1 / 50000)
            # print(loss)
            loss.backward()
            optimizer.step()

            iteration += 1
            if iteration % 250 == 0:
                print(loss)
                correct = 0
                total = 0
                with torch.no_grad():
                    for data in test_loader:
                        images, labels = data
                        outputs = classifier(images.to(device))
                        _, predicted = torch.max(outputs.data, 1)
                        total += labels.size(0)
                        correct += (predicted == labels.to(device)).sum().item()
                print('Iteration: {} | Accuracy of the network on the 10000 test images: {} %'
                      .format(str(iteration), str(100 * correct / total)))

if __name__ == '__main__':
    main()

donhauser commented 3 years ago

I've had the same error, but with the BayesianGRU instead.

The problem is caused by calculating log(0) in the log_prior(w) calculation of PriorWeightDistribution in weight_sampler.py. Because of log(0)=-inf, the sampled weights partially get filled with nan-values, which then causes the ValueError when passed to pytorchs distribution library.

I fixed the problem for me like by changing PriorWeightDistribution.log_prior():

# OLD CODE
# prior_pdf can be zero due to nummerics --> -inf possible
return (torch.log(prior_pdf) - 0.5).sum()

# NEW CODE
# adding a tiny number (1e-6 in my case) resolves the problem
return (torch.log(prior_pdf+1e-6) - 0.5).sum()

piEsposito commented 3 years ago

@donhauser that's awesome. I always lose track of the points where we need to add those small values to keep numerical stability. If you want to PR I'll merge it.

piEsposito / blitz-bayesian-deep-learning

ValueError: The value argument must be within the support #82

Error traceback

Code