Training a simple transformer model with Opacus produces runtime error due to mismatch in dimensions

nhianK commented 3 months ago

I am trying to train a model with opacus. Link to the original model: https://github.com/jdxyw/deepKT/blob/master/deepkt/model/saint.py I replaced MultiheadAttention with DPMultiheadAttention. This issue was brought up before in #505. I printed the dimensions of the per sample norms(in the last cell). However, I could not determine the exact issue. Here is a link to reproduce the error: https://colab.research.google.com/drive/11wf7tEUOOlWHoMcw2jP2Zlf6ooPGoGbx?usp=sharing.

traceback:

in () ----> 1 run(learning_rate, batch_size, num_skill, embed_dim, dropout, num_heads, num_enc, epoch, num_worker, max_gradient_norm, delta, epsilon) 5 frames in run(learning_rate, batch_size, num_skill, embed_dim, dropout, num_heads, num_enc, epoch, num_worker, max_gradient_norm, delta, epsilon) 45 46 for epoch in range(epoch): ---> 47 train_epoch(saint, train_dataloader, optimizer, loss_func, 48 device) 49 if privacy_engine: in train_epoch(model, train_iterator, optim, criterion, device) 31 if per_sample_grad is not None: 32 print(f"Per-sample gradient shape for {name}: {per_sample_grad.shape}") ---> 33 optim.step() 34 35 /usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs) 73 instance._step_count += 1 74 wrapped = func.__get__(instance, cls) ---> 75 return wrapped(*args, **kwargs) 76 77 # Note that the returned function here is no longer a bound method, /usr/local/lib/python3.10/dist-packages/opacus/optimizers/optimizer.py in step(self, closure) 551 with torch.enable_grad(): 552 closure() --> 553 if self.pre_step(): 554 return self.original_optimizer.step() 555 else: /usr/local/lib/python3.10/dist-packages/opacus/optimizers/optimizer.py in pre_step(self, closure) 536 if self.grad_samples is None or len(self.grad_samples) == 0: 537 return True --> 538 self.clip_and_accumulate() 539 if self._check_skip_next_step(): 540 self._is_last_step_skipped = True /usr/local/lib/python3.10/dist-packages/opacus/optimizers/optimizer.py in clip_and_accumulate(self) 442 g.reshape(len(g), -1).norm(2, dim=-1) for g in self.grad_samples 443 ] --> 444 per_sample_norms = torch.stack(per_param_norms, dim=1).norm(2, dim=1) 445 per_sample_clip_factor = ( 446 self.max_grad_norm / (per_sample_norms + 1e-6) RuntimeError: stack expects each tensor to be equal size, but got [70] at entry 0 and [1] at entry 2

HuanyuZhang commented 3 months ago

Hi, for Opacus to work, we need the input of each module to be consistent in having "batch_size" as the first or second dimension (by default it is the first, link). However, in your code, after you permute in the forward pass, Opacus gets confused in understanding which dimension is for batch_size, and this is why you see the per-sample-grad dimension gets messed up.

HuanyuZhang commented 1 month ago

Close the issue due to no response. Feel free to re-open if the question is unresolved.

pytorch / opacus

Training a simple transformer model with Opacus produces runtime error due to mismatch in dimensions #666