Backward pass with Opacus and ViT model. Batching rule not implemented for aten::_chunk_grad_outputs_efficient_attention.

🐛 Bug

Hi everyone! I'm trying to implement a private Vision Transformer (ViT) model. The idea is to load a pretrained model and then train privately with CIFAR10. I'm using Opacus for applying the privacy over the fine tuning. The problem is that during the backward pass, I'm getting the following error:

 File "/.local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 69, in __call__
    return self.hook(module, *args, **kwargs)
  File "/opacus/opacus/grad_sample/grad_sample_module.py", line 321, in capture_backprops_hook
    grad_samples = grad_sampler_fn(module, activations, backprops)
  File "/opacus/opacus/grad_sample/functorch.py", line 55, in ft_compute_per_sample_gradient
    per_sample_grads = layer.ft_compute_sample_grad(parameters, activations, backprops)
  File "/.local/lib/python3.10/site-packages/torch/_functorch/vmap.py", line 434, in wrapped
    return _flat_vmap(
  File "/.local/lib/python3.10/site-packages/torch/_functorch/vmap.py", line 39, in fn
    return f(*args, **kwargs)
  File "/.local/lib/python3.10/site-packages/torch/_functorch/vmap.py", line 619, in _flat_vmap
    batched_outputs = func(*batched_inputs, **kwargs)
  File "/.local/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py", line 1380, in wrapper
    results = grad_and_value(func, argnums, has_aux=has_aux)(*args, **kwargs)
  File "/.local/lib/python3.10/site-packages/torch/_functorch/vmap.py", line 39, in fn
    return f(*args, **kwargs)
  File "/.local/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py", line 1267, in wrapper
    flat_grad_input = _autograd_grad(flat_outputs, flat_diff_args, create_graph=True)
  File "/.local/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py", line 113, in _autograd_grad
    grad_inputs = torch.autograd.grad(diff_outputs, inputs, grad_outputs,
  File "/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 303, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Batching rule not implemented for aten::_chunk_grad_outputs_efficient_attention. We could not generate a fallback.
srun: error: r01g01: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=17652576.0

This only happens when training on GPU. When trained locally, or in the colab with cpu, it works well. I tried to run it both on a cluster with GPU and in Colab, both didn't work and got the same error message.

Any idea why it might be happening? The thing that puzzle me the most is the fact that it works well locally, but not on GPU. Maybe it has to do with the ViT architecture and there is the need to freeze some layers? Because it also works well for a ResNet model.

Here is the Colab, for you to reproduce the problem Colab.

Colab.

To Reproduce

Select GPU as the accelerator in runtime in Colab
In the last cell, try it with GPU
To see if it works on CPU, uncomment the line: device = 'cpu'

Local Environment (they are the same in Colab):

PyTorch Version: 2.0.1
Opacus Version: 1.4.0
OS: Linux
How you installed PyTorch: pip
Python version: 3.10.12
CUDA/cuDNN version: cuda_11.8.r11.8
GPU models and configuration:

Additional context

I also don't know if I this report should go here in Opacus, or directly in PyTorch. Thanks also for the guidance!

pytorch / opacus