pytorch / opacus

Training PyTorch models with differential privacy
https://opacus.ai
Apache License 2.0
1.72k stars 345 forks source link

Optimizer.step() fails with/due to register_backward_hook #246

Closed donarni closed 2 years ago

donarni commented 3 years ago

❓ Questions and Help

Running BiDAF after hooking with privacy_engine raises the following warning:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:974: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.

warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "

Both model.forward(**batch) and loss.backward() execute without warnings, however optimizer.step() and optimizer.virtual_step() terminate with the following error:

RuntimeError: stack expects each tensor to be equal size, but got [4] at entry 0 and [618] at entry 1

I'm using DPLSTM layers provided by Opacus, see Colab. If needed, I can provide the Tensors to run the code.

Can anyone help me? I've implemented ULMFit, ESIM and other LSTM-based NNs without this issue, however this my first atempt to incorporate CNN-based char embeddings and LSTM-based word embeddings with Opcaus.

romovpa commented 3 years ago

@donarni Thanks for your feedback! The error looks strange. Could you please provide the tensors so I can reproduce and debug this?

One hypothesis I have after skimming your colab is that packed sequence can be handled incorrectly somewhere.

donarni commented 3 years ago

Thank you for responding and your first suggestion!

Find the embedding vectors and a sample batch on mediafire.com. Batch is a 8-tuple consisting of the context (chars), context (words), context_word_lengths, question (chars), question (words), question_word_length, answer_start_token, answer_end_token.

Note that I pad word vectors to the maximum sentence length in train data (fixed length). Maybe anyone can help me to re-write the BiDAF model without packed sequences <3

romovpa commented 3 years ago

Ok, that's a quite interesting case.

The problem is that when a layer is applied more than once, by default the grad sampler doesn't sum partials of all applications within the same example. As the result, for some layers we have not B grads, but B*(number_of_calls), if B is the batch size.

Here are sizes of grad_sample for each parameter after loss.backward(). Script + data for local debug

In DPLSTM we use RNNLinear which is a clone of nn.Linear which uses different grad sampler to accumulate the partial grads.

To fix highway_network and att_flow_layer we can replace nn.Linear -> opacus.layers.dp_rnn.RNNLinear. To fix char_emb we need a similar modification for nn.Embedding. And I don't understand yet why grad_sample for char_conv has length 618.

We probably need to update the grad sampler to handle this automatically. One tricky thing we discussed with @ffuuugor recently: multiple application of the same layer can also be in a siamese network where the input consist of more than one example. This case should be handled differently.

donarni commented 3 years ago

Thank you for the explanation. As a beginner, I am probably better advised to switch to a different architecture.

Speaking of different architectures, I found the same error when applying the privacy_engine to ARC-I (CNN-based) and ESIM (LSTM-based) for SNLI dataset. Code can be found on MatchZoo. If interested, I can provide some Tensors :)

Unfortunately, this problem basically limits Opacus to Transformer-based models for NLP.

romovpa commented 3 years ago

@donarni I will be able to help creating a workaround for your BiDAF code next week. Does it work?

You unearthed an important limitation. Opacus is especially useful for preserving privacy when training language models generating texts. It is not good that it can't deal out-of-the-box with the mentioned architectures.

Although, this seems to me not a fundamental limitation. I believe we can handle the case when the layers share parameters without significant refactoring.

@alexandresablayrolles @ffuuugor @karthikprasad (Don't know who is better to ask) Could someone explain to me shortly, why don't we use create_or_accumulate_grad_sample for nn.Linear (like we do for RNNLinear)? What wil break if we change nn.Linear's grad sampler to RNNLinear's one?

romovpa commented 3 years ago

@donarni

Speaking of different architectures, I found the same error when applying the privacy_engine to ARC-I (CNN-based) and ESIM (LSTM-based) for SNLI dataset. Code can be found on MatchZoo. If interested, I can provide some Tensors :)

I would appreciate if you can share the code with tensors. It can be useful in our discussion of how we should update the grad sampler.

Also, I believe one of these architectures worth having an official example. Is there are public datasets to train them?

donarni commented 3 years ago

@romovpa your help would be much appreciated! As you may have guessed, we are currently experimenting with DP-SGD for textual analysis beyond topic classification and sentiment analysis tasks, and beyond fastText or BERT.

I extracted the code from my pipeline. Please ignore some unrelated code and the fact that I'm using a legacy version of torchtext. By the way,.as_iterable() from Data returns a DataLoader with UniformWithReplacementSampler.

Textual Entailment ESIM for SNLI with SNLI dataset converted to .csv mediafire.com. Question Answering BiDAF for SQUAD with SQUAD dataset converted to .csv mediafire.com.

karthikprasad commented 3 years ago

@alexandresablayrolles @ffuuugor @karthikprasad (Don't know who is better to ask) Could someone explain to me shortly, why don't we use create_or_accumulate_grad_sample for nn.Linear (like we do for RNNLinear)? What wil break if we change nn.Linear's grad sampler to RNNLinear's one?

IIRC, the intention was to use create_or_extend_grad_sample everywhere as we don't want to overwrite the grad_samples with multiple calls specifically with virtual_steps. Except for RNNLinear, where the same layer needs to be updated by virtue of it being a recurring and hence the need for create_or_accumulate_grad_sample

ffuuugor commented 3 years ago

If I'm not mistaken, RNNLinear is no longer relevant with the new implementations.

It was introduced because we needed a way to distinguish between layers where another iteration meant new batch and layers where another iteration meant just another iteration with the same data. The distinction is important, because it affects how he accumulate gradients: in recurrent nets we accumulate gradients over multiple iterations, while in simple one-pass nets we'd use new data to compute new per sample gradients, keeping old ones intact.

The reason we don't need that anymore is GradSampleModule. We can now detect the difference between recurrent calls and next batch calls (see grad_sample_module:270-275), and behave accordingly for all layers

romovpa commented 3 years ago

Thanks @ffuuugor! I tried experimental_v1.0, indeed there is no problem with layers called more than once.

The problem with the layer char_conv remains. Here is how it works.

self.char_conv = nn.Sequential(nn.Conv2d(1, 100, (8, 5)), nn.ReLU())

        def char_emb_layer(x):
            """
            :param x: (batch, seq_len, word_len)
            :return: (batch, seq_len, char_channel_size)
            """
            batch_size = x.size(0)
            # (batch, seq_len, word_len, char_dim)
            x = self.dropout(self.char_emb(x))
            # (batch, seq_len, char_dim, word_len)
            x = x.transpose(2, 3)
            # (batch * seq_len, 1, char_dim, word_len)
            x = x.view(-1, 8, x.size(3)).unsqueeze(1)
            # (batch * seq_len, char_channel_size, 1, conv_len) -> (batch * seq_len, char_channel_size, conv_len)
            x = self.char_conv(x).squeeze()
            # (batch * seq_len, char_channel_size, 1) -> (batch * seq_len, char_channel_size)
            x = F.max_pool1d(x, x.size(2)).squeeze()
            # (batch, seq_len, char_channel_size)
            x = x.view(batch_size, -1, 100)

This transformation probably is the answer.

After the backward step with batch_size=2 we get the following grad_sample sizes:

_module.char_conv.0.weight                         [580, 100, 1, 8, 5]
_module.char_conv.0.bias                           [580, 100]
_module.char_emb.weight                            [2, 140, 8]
_module.highway_linear0.0.weight                   [2, 200, 200]
_module.highway_linear0.0.bias                     [2, 200]
_module.highway_gate0.0.weight                     [2, 200, 200]
_module.highway_gate0.0.bias                       [2, 200]
...

Modified code: https://gist.github.com/romovpa/cdbfae8f6b601b7e41472c2bf4f8a777
@donarni fyi

karthikprasad commented 2 years ago

I am closing this issue as it seems to have been resolved. Please feel free to reopen if this isn't the case.