Functorch Usecase Jacobian computation

nilsleh commented 2 years ago

Hi, Thank you for this exciting work. I will try to explain a use case that I hope will be possible with functorch because the naive way is just extremely slow.

Setting: The use case follows this paper about Bayesian Deep Learning via a Laplace approximation to the weights of a subnetwork. While the paper and in fact most papers about LaPlace approximation only test on simple regression or classification tasks, I hope to make this work reasonably for a convolutional layer block and hence higher dimensions. Thus consider the following:

Assume an feature encoding from a convolutional layer, that is supposed to go through another convolutional block to produce a final model output.

class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super(ConvBlock, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, in_ch, 3, 1),
            nn.Conv2d(in_ch, in_ch, 3, 1),
            nn.Conv2d(in_ch, out_ch, 3, 1)
            )
    def forward(self, x):
        return self.conv(x)

We can furthermore define,

batch_size = 2
in_channels = 5
out_channels = 20
feature_shape = 8
feature = torch.rand(batch_size, in_channels, feature_shape, feature_shape)

With the Laplace approximation I am interested in the Jacobian of the model outputs w.r.t. to the model weights for each input sample, and in this case of only a much smaller subset of weights in accordance with the subnet approximation.

Standard PyTorch only allows the .backward() call on a single element tensor, and thus a naive approach is to loop over the outputs elements and iteratively call .backward(retain_graph=True) and clearing accumulated gradients after each sample, to obtain a Jacobian for all network parameters and lastly index this weight vector with the relevant indices to obtain a much smaller subnet weight vector that is actually required to do the bayesian approximation. This is, however, inefficient and time consuming, but importantly does not yield memory errors.

I have tried to use functorch to do this computation more efficiently and attempted the following based on the Per-Sample-Gradient tutorial:

model = ConvBlock(in_channels, out_channels)
fmodel, params, buffers = make_functional_with_buffers(model)

def compute_output_stateless_model(params, buffers, feature):
    batch = feature.unsqueeze(0)
    output = fmodel(params, buffers, batch)
    output = output.view(batch.shape[0], -1, 8)
    return output

ft_compute_grad = jacrev(compute_output_stateless_model)
ft_compute_sample_grad = vmap(ft_compute_grad, in_dims=(None, None, 0))
ft_per_sample_grads = ft_compute_sample_grad(params, buffers, feature)

This works fine for the small example, but fails for more realistic convolutional network sizes and feature dimensions, because the Jacobian is computed with respect to all network weights, when I am only really interested in a much small subset that I need to do the LaPlace approximation. So somehow something like

ft_per_sample_grads = ft_compute_sample_grad(params[index only relavant params], buffers, feature)

in order to compute the gradient only for weights I am interested and hence does not fail due to memory.

I was hoping that someone could comment with a suggestion on this use case as I am by no means and expert in functorch and might miss a possible approach that is feasible. Thanks in advance.

zou3519 commented 2 years ago

Thanks for the issue, @nilsleh. If you are only interested in computing quantities for specific parameter tensors, functorch.jacrev computes the jacobian w.r.t. to the first argument it is passed. So if I am understanding your use case correctly, we should only pass the parameters you want to compute the jacobian for as the first argument.

Here's an example, assuming that the relevant param indices are relevant_param_indices = (0, 2, 3):

import torch
import torch.nn as nn
from functorch import vmap, jacrev, make_functional_with_buffers

batch_size = 2
in_channels = 5
out_channels = 20
feature_shape = 8
feature = torch.rand(batch_size, in_channels, feature_shape, feature_shape)

class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super(ConvBlock, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, in_ch, 3, 1),
            nn.Conv2d(in_ch, in_ch, 3, 1),
            nn.Conv2d(in_ch, out_ch, 3, 1)
            )
    def forward(self, x):
        return self.conv(x)

model = ConvBlock(in_channels, out_channels)
fmodel, params, buffers = make_functional_with_buffers(model)

# NB: the following code assumes that the indices are unique and sorted
relevant_param_indices = (0, 2, 3)

def split(params, relevant_param_indices):
    relevant_params = []
    other_params = []
    for i, param in enumerate(params):
        if i in relevant_param_indices:
            relevant_params.append(param)
        else:
            other_params.append(param)
    return tuple(relevant_params), tuple(other_params)

def combine(relevant_params, other_params, relevant_param_indices):
    relevant_params_iter = iter(relevant_params)
    other_params_iter = iter(other_params)
    num_total_params = len(relevant_params) + len(other_params)
    params = []
    for i in range(num_total_params):
        if i in relevant_param_indices:
            params.append(next(relevant_params_iter))
        else:
            params.append(next(other_params_iter))
    return tuple(params)

def compute_output_stateless_model(relevant_params, other_params, buffers, feature):
    params = combine(relevant_params, other_params, relevant_param_indices)
    batch = feature.unsqueeze(0)
    output = fmodel(params, buffers, batch)
    output = output.view(batch.shape[0], -1, 8)
    return output

relevant_params, other_params = split(params, relevant_param_indices)

ft_compute_grad = jacrev(compute_output_stateless_model)
ft_compute_sample_grad = vmap(ft_compute_grad, in_dims=(None, None, None, 0))
ft_per_sample_grads = ft_compute_sample_grad(relevant_params, other_params, buffers, feature)

nilsleh commented 2 years ago

Thank you very much for your reply @zou3519. I'm sorry I was not specific enough in my description. For the Jacobian computation, I am actually interested at an even more specific level, so single parameter weights across the conv layer. Still, your approach with splitting and combining the input is also possible for this and I have made a gist where relevant_param_indices holds indices to index a flattened parameter vector of all parameters within ConvBlock. Thank you for that suggested approach.

I have another question/layer of complexity that was neglected in the described problem, which is whether this would also work for a list of features that are being passed into a model:

So ConvBlock becomes:

class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super(ConvBlock, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, in_ch, 3, 1),
            nn.Conv2d(in_ch, in_ch, 3, 1),
            nn.Conv2d(in_ch, out_ch, 3, 1),
        )

    def forward(self, x_list):
        batch_size = x_list[0].shape[0]
        final_out = [self.conv(x).view(batch_size, -1, 8) for x in x_list]
        return torch.cat(final_out, dim=1)

and feature is now a feature list where each element is still a batch tensor:

features = [
    torch.rand(batch_size, in_channels, feature_shape, feature_shape),
    torch.rand(batch_size, in_channels, feature_shape-10, feature_shape-10),
]

After the model outputs, there is another operation to reduce the number of final outputs and only on this quantity do I need to compute the Jacobian of that output with respect to the relevant params.

However, then I am running into issues, because output = fmodel(params, buffers, batch) has to accept a list as its last argument. And I don't know how to change compute_output_stateless_model

def compute_output_stateless_model(relevant_params, other_params, buffers, feature_list):
    params = combine(relevant_params, other_params, relevant_param_indices)
    batch = [feature.unsqueeze(0) for feature in feature_list] 
    output = fmodel(params, buffers, batch) # breaks here
    smaller_output = some_operation(output)
    return smaller_output

Thank you in advance.

Edit: After some tinkering I suppose one can compute gradients for each batch seperately, by calling ft_grads = ft_compute_grad(relevant_params, other_params, buffers, _singlefeature) so skipping the vmap, because then compute_output_stateless_model can be called with a list of feature tensors and still only compute jacobian for a subset of parameters. But I still wonder if there is a vmap approach?

zou3519 commented 2 years ago

Do I understand correctly that you're performing a for-loop over the following, and wish to do a vmap instead?

features = [
    torch.rand(batch_size, in_channels, feature_shape, feature_shape),
    torch.rand(batch_size, in_channels, feature_shape-10, feature_shape-10),
]

vmap requires that the data being vmapped over fits into a single Tensor, but the above two tensors have different sizes/shapes, so this isn't currently possible. But would be an interesting application of vmap over something like a MaskedTensor (cc @george-qi @cpuhrsch)

nilsleh commented 2 years ago

Yes, so an example would be a neural net that computes features at different resolutions or scales that subsequently get processed together for a final output.

cpuhrsch commented 2 years ago

We now have support for NestedTensors in core. That means you could construct a NestedTensor via

torch.nested_tensor([
    torch.rand(batch_size, in_channels, feature_shape, feature_shape),
    torch.rand(batch_size, in_channels, feature_shape-10, feature_shape-10),
])

Having said that, operator coverage is still pretty minimal (for example, convolutions aren't support just yet). However, is this something you had in mind?

nilsleh commented 2 years ago

That looks like a potential solution at some point, once convolutions are supported. I will keep an eye out.

cpuhrsch commented 2 years ago

cc @jbschlosser

zou3519 commented 2 years ago

@cpuhrsch does one need to use vmap over nestedtensor in this case? With a nested tensor like the following:

torch.nested_tensor([
    torch.rand(batch_size, in_channels, feature_shape, feature_shape),
    torch.rand(batch_size, in_channels, feature_shape-10, feature_shape-10),
])

is it 4-dimensional or 5-dimensional? If it is 5-dimensional, does F.conv2d work on it? (If not, then we may need to use vmap...)

cpuhrsch commented 2 years ago

@zou3519 - that would be 5 dimensional. conv2d requires 4-dim inputs, so it wouldn't be able to accept it. I think in general vmap support for NestedTensor is a great idea (we recently landed reshape and transpose, so it should be much closer to supported now). It's also straightforward to provide a bad conv2d kernel (via loops) just to gain coverage, but an efficient conv2d kernel isn't trivial.

jbschlosser commented 2 years ago

@cpuhrsch this would be 4-dimensional though:

torch.nested_tensor([
    torch.rand(in_channels, feature_shape, feature_shape),
    ...
    torch.rand(in_channels, feature_shape-10, feature_shape-10),
    ...
])

i.e. use the NT's implicit batch size instead of representing an explicit batch size in the constituents. But I think F.conv2d still won't work on this at the moment.

pytorch / functorch

Functorch Usecase Jacobian computation #943