Closed donarni closed 2 years ago
@donarni Thanks for your feedback! The error looks strange. Could you please provide the tensors so I can reproduce and debug this?
One hypothesis I have after skimming your colab is that packed sequence can be handled incorrectly somewhere.
Thank you for responding and your first suggestion!
Find the embedding vectors and a sample batch on mediafire.com. Batch is a 8-tuple consisting of the context (chars), context (words), context_word_lengths, question (chars), question (words), question_word_length, answer_start_token, answer_end_token.
Note that I pad word vectors to the maximum sentence length in train data (fixed length). Maybe anyone can help me to re-write the BiDAF model without packed sequences <3
Ok, that's a quite interesting case.
The problem is that when a layer is applied more than once, by default the grad sampler doesn't sum partials of all applications within the same example. As the result, for some layers we have not B grads, but B*(number_of_calls), if B is the batch size.
Here are sizes of grad_sample for each parameter after loss.backward()
.
Script + data for local debug
In DPLSTM we use RNNLinear which is a clone of nn.Linear
which uses different grad sampler to accumulate the partial grads.
To fix highway_network
and att_flow_layer
we can replace nn.Linear -> opacus.layers.dp_rnn.RNNLinear
.
To fix char_emb
we need a similar modification for nn.Embedding
. And I don't understand yet why grad_sample for char_conv
has length 618.
We probably need to update the grad sampler to handle this automatically. One tricky thing we discussed with @ffuuugor recently: multiple application of the same layer can also be in a siamese network where the input consist of more than one example. This case should be handled differently.
Thank you for the explanation. As a beginner, I am probably better advised to switch to a different architecture.
Speaking of different architectures, I found the same error when applying the privacy_engine
to ARC-I (CNN-based) and ESIM (LSTM-based) for SNLI
dataset. Code can be found on MatchZoo. If interested, I can provide some Tensors :)
Unfortunately, this problem basically limits Opacus
to Transformer-based models for NLP.
@donarni I will be able to help creating a workaround for your BiDAF code next week. Does it work?
You unearthed an important limitation. Opacus is especially useful for preserving privacy when training language models generating texts. It is not good that it can't deal out-of-the-box with the mentioned architectures.
Although, this seems to me not a fundamental limitation. I believe we can handle the case when the layers share parameters without significant refactoring.
@alexandresablayrolles @ffuuugor @karthikprasad (Don't know who is better to ask) Could someone explain to me shortly, why don't we use create_or_accumulate_grad_sample
for nn.Linear
(like we do for RNNLinear
)? What wil break if we change nn.Linear's grad sampler to RNNLinear's one?
@donarni
Speaking of different architectures, I found the same error when applying the privacy_engine to ARC-I (CNN-based) and ESIM (LSTM-based) for SNLI dataset. Code can be found on MatchZoo. If interested, I can provide some Tensors :)
I would appreciate if you can share the code with tensors. It can be useful in our discussion of how we should update the grad sampler.
Also, I believe one of these architectures worth having an official example. Is there are public datasets to train them?
@romovpa your help would be much appreciated! As you may have guessed, we are currently experimenting with DP-SGD for textual analysis beyond topic classification and sentiment analysis tasks, and beyond fastText
or BERT
.
I extracted the code from my pipeline. Please ignore some unrelated code and the fact that I'm using a legacy version of torchtext
. By the way,.as_iterable()
from Data
returns a DataLoader
with UniformWithReplacementSampler
.
Textual Entailment
ESIM for SNLI with SNLI
dataset converted to .csv
mediafire.com.
Question Answering
BiDAF for SQUAD with SQUAD
dataset converted to .csv
mediafire.com.
@alexandresablayrolles @ffuuugor @karthikprasad (Don't know who is better to ask) Could someone explain to me shortly, why don't we use
create_or_accumulate_grad_sample
fornn.Linear
(like we do forRNNLinear
)? What wil break if we change nn.Linear's grad sampler to RNNLinear's one?
IIRC, the intention was to use create_or_extend_grad_sample
everywhere as we don't want to overwrite the grad_samples with multiple calls specifically with virtual_steps. Except for RNNLinear, where the same layer needs to be updated by virtue of it being a recurring and hence the need for create_or_accumulate_grad_sample
If I'm not mistaken, RNNLinear is no longer relevant with the new implementations.
It was introduced because we needed a way to distinguish between layers where another iteration meant new batch and layers where another iteration meant just another iteration with the same data. The distinction is important, because it affects how he accumulate gradients: in recurrent nets we accumulate gradients over multiple iterations, while in simple one-pass nets we'd use new data to compute new per sample gradients, keeping old ones intact.
The reason we don't need that anymore is GradSampleModule. We can now detect the difference between recurrent calls and next batch calls (see grad_sample_module:270-275), and behave accordingly for all layers
Thanks @ffuuugor! I tried experimental_v1.0
, indeed there is no problem with layers called more than once.
The problem with the layer char_conv remains. Here is how it works.
self.char_conv = nn.Sequential(nn.Conv2d(1, 100, (8, 5)), nn.ReLU())
def char_emb_layer(x):
"""
:param x: (batch, seq_len, word_len)
:return: (batch, seq_len, char_channel_size)
"""
batch_size = x.size(0)
# (batch, seq_len, word_len, char_dim)
x = self.dropout(self.char_emb(x))
# (batch, seq_len, char_dim, word_len)
x = x.transpose(2, 3)
# (batch * seq_len, 1, char_dim, word_len)
x = x.view(-1, 8, x.size(3)).unsqueeze(1)
# (batch * seq_len, char_channel_size, 1, conv_len) -> (batch * seq_len, char_channel_size, conv_len)
x = self.char_conv(x).squeeze()
# (batch * seq_len, char_channel_size, 1) -> (batch * seq_len, char_channel_size)
x = F.max_pool1d(x, x.size(2)).squeeze()
# (batch, seq_len, char_channel_size)
x = x.view(batch_size, -1, 100)
This transformation probably is the answer.
After the backward step with batch_size=2 we get the following grad_sample sizes:
_module.char_conv.0.weight [580, 100, 1, 8, 5]
_module.char_conv.0.bias [580, 100]
_module.char_emb.weight [2, 140, 8]
_module.highway_linear0.0.weight [2, 200, 200]
_module.highway_linear0.0.bias [2, 200]
_module.highway_gate0.0.weight [2, 200, 200]
_module.highway_gate0.0.bias [2, 200]
...
Modified code: https://gist.github.com/romovpa/cdbfae8f6b601b7e41472c2bf4f8a777
@donarni fyi
I am closing this issue as it seems to have been resolved. Please feel free to reopen if this isn't the case.
❓ Questions and Help
Running BiDAF after hooking with
privacy_engine
raises the following warning:/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:974: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
Both
model.forward(**batch)
andloss.backward()
execute without warnings, howeveroptimizer.step()
andoptimizer.virtual_step()
terminate with the following error:RuntimeError: stack expects each tensor to be equal size, but got [4] at entry 0 and [618] at entry 1
I'm using DPLSTM layers provided by Opacus, see Colab. If needed, I can provide the Tensors to run the code.
Can anyone help me? I've implemented ULMFit, ESIM and other LSTM-based NNs without this issue, however this my first atempt to incorporate CNN-based char embeddings and LSTM-based word embeddings with Opcaus.