Blitz Bayesian model wrapped in DataParallel not using multiple GPUs

jsanket123 commented 3 years ago

I have been trying to use Bayesian linear regression example given by Blitz authors and parallelize their model by wrapping it with torch.nn.DataParallel. However, it seems that the given code is only using one gpu and not multiple gpus. Below is the same code from the bayesian_regression_boston.py example with model wrapped in DataParallel.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

from blitz.modules import BayesianLinear
from blitz.utils import variational_estimator

from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)
X = StandardScaler().fit_transform(X)
y = StandardScaler().fit_transform(np.expand_dims(y, -1))

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=.25,
                                                    random_state=42)

X_train, y_train = torch.tensor(X_train).float(), torch.tensor(y_train).float()
X_test, y_test = torch.tensor(X_test).float(), torch.tensor(y_test).float()

@variational_estimator
class BayesianRegressor(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        #self.linear = nn.Linear(input_dim, output_dim)
        self.blinear1 = BayesianLinear(input_dim, 512)
        self.blinear2 = BayesianLinear(512, output_dim)

    def forward(self, x):
        print("\tIn Model: input size", x.size())
        x_ = self.blinear1(x)
        x_ = F.relu(x_)
        return self.blinear2(x_)

def evaluate_regression(regressor,
                        X,
                        y,
                        samples = 100,
                        std_multiplier = 2):
    preds = [regressor(X) for i in range(samples)]
    preds = torch.stack(preds)
    means = preds.mean(axis=0)
    stds = preds.std(axis=0)
    ci_upper = means + (std_multiplier * stds)
    ci_lower = means - (std_multiplier * stds)
    ic_acc = (ci_lower <= y) * (ci_upper >= y)
    ic_acc = ic_acc.float().mean()
    return ic_acc, (ci_upper >= y).float().mean(), (ci_lower <= y).float().mean()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
regressor = BayesianRegressor(13, 1).to(device)
optimizer = optim.Adam(regressor.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()

if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  regressor = nn.DataParallel(regressor)
regressor.to(device)

ds_train = torch.utils.data.TensorDataset(X_train, y_train)
dataloader_train = torch.utils.data.DataLoader(ds_train, batch_size=16, shuffle=True)

ds_test = torch.utils.data.TensorDataset(X_test, y_test)
dataloader_test = torch.utils.data.DataLoader(ds_test, batch_size=16, shuffle=True)

iteration = 0
for epoch in range(100):
    for i, (datapoints, labels) in enumerate(dataloader_train):
        optimizer.zero_grad()

        print("Outside: input size", datapoints.size())
        loss = regressor.module.sample_elbo(inputs=datapoints.to(device),
                           labels=labels.to(device),
                           criterion=criterion,
                           sample_nbr=1,
                           complexity_cost_weight=1/X_train.shape[0])
        loss.backward()
        optimizer.step()

        iteration += 1
        if iteration%100==0:
            ic_acc, under_ci_upper, over_ci_lower = evaluate_regression(regressor,
                                                                        X_test.to(device),
                                                                        y_test.to(device),
                                                                        samples=25,
                                                                        std_multiplier=3)

            print("CI acc: {:.2f}, CI upper acc: {:.2f}, CI lower acc: {:.2f}".format(ic_acc, under_ci_upper, over_ci_lower))
            print("Loss: {:.4f}".format(loss))

Below I provide the portion of the output. I print out the input dimension before calling the loss.module.sample_elbo and that should have the batch_size x no_variables (which is correctly printed out). However, inside the model's forward map it should print out the dimensions of each smaller batch taken by each of the GPU so it should have printed out 8 lines as 'In Model: input size torch.Size([2, 13])'. But apparently, it is only putting data on one GPU.

Could you please let me know what needs to be done for this to work on multiple GPUs? FYI: I use the methods suggested here to check whether multiple GPUs are actually being used for processing the input batch.

Let's use 8 GPUs!
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])
Outside: input size torch.Size([16, 13])
        In Model: input size torch.Size([16, 13])

piEsposito commented 3 years ago

That is still not on our roadmap, and I think that the probabilistic layers do not implement all the needed inferfaces to comply with Torch DataParallel. if you want to work on that, it would be much appreciated.

jsanket123 commented 3 years ago

Hey thanks for getting back to me. I was able to implement my model in a parallelized way. It definitely takes lots of restructuring than what we normally do for one GPU. This is my current work so I will not be releasing it until I go through the review process for my paper.

piEsposito / blitz-bayesian-deep-learning

Blitz Bayesian model wrapped in DataParallel not using multiple GPUs #84