Open beneyal opened 5 years ago
Also hitting this. It's related to the magic around flatten_parameters
, which apparently has changed in Pytorch 1.0
I have not yet had time to look into this in detail, but I will probably try to dig deeper
[EDIT - corrected non-train-time behavior (oops!)]
Here's my hacky (very) solution, which I think is working ok (and should work with both Pytorch 1.0 and earlier versions). It does a little more tensor copying, but in practice they tend not to be huge tensors and its a once-per-minibatch thing, so not much overall impact (at least in my usage):
class BackHook(torch.nn.Module):
def __init__(self, hook):
super(BackHook, self).__init__()
self._hook = hook
self.register_backward_hook(self._backward)
def forward(self, *inp):
return inp
@staticmethod
def _backward(self, grad_in, grad_out):
self._hook()
return None
class WeightDrop(torch.nn.Module):
"""
Implements drop-connect, as per Merity et al https://arxiv.org/abs/1708.02182
"""
def __init__(self, module, weights, dropout=0, variational=False):
super(WeightDrop, self).__init__()
self.module = module
self.weights = weights
self.dropout = dropout
self.variational = variational
self._setup()
self.hooker = BackHook(lambda: self._backward())
def _setup(self):
for name_w in self.weights:
print('Applying weight drop of {} to {}'.format(self.dropout, name_w))
w = getattr(self.module, name_w)
self.register_parameter(name_w + '_raw', Parameter(w.data))
def _setweights(self):
for name_w in self.weights:
raw_w = getattr(self, name_w + '_raw')
if self.training:
mask = raw_w.new_ones((raw_w.size(0), 1))
mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
w = mask.expand_as(raw_w) * raw_w
setattr(self, name_w + "_mask", mask)
else:
w = raw_w
rnn_w = getattr(self.module, name_w)
rnn_w.data.copy_(w)
def _backward(self):
# transfer gradients from embeddedRNN to raw params
for name_w in self.weights:
raw_w = getattr(self, name_w + '_raw')
rnn_w = getattr(self.module, name_w)
raw_w.grad = rnn_w.grad * getattr(self, name_w + "_mask")
def forward(self, *args):
self._setweights()
return self.module(*self.hooker(*args))
@sdraper-CS I am very curious as to what this magic actually does and why it is needed. Could you elaborate on that?
@sdraper-CS Thanks for your solution, BTW, it seems to work (I am still running it, waiting for the results to see if they match the paper). However, I got an error on this line, because pickle didn't like the lambda:
self.hooker = BackHook(lambda: self._backward())
Changing the line to
self.hooker = BackHook(self._backward)
solved the error.
Encountered the following issue with https://github.com/salesforce/awd-lstm-lm/issues/86#issuecomment-447910610 solution.
AttributeError: 'WeightDrop' object has no attribute 'weight_hh_l0_mask'
However, this PR https://github.com/pytorch/pytorch/pull/15766 seems to be working perfectly. I haven't tested it completely though.
@sdraper-CS I am very curious as to what this magic actually does and why is it needed. Could you elaborate on that?
The issue is that the changes in PyTorch 1.0 make it difficult to emplace a new tensor for the weights on each batch, so instead the idea is to mask the elements of the existing weights tensor in-situ. However, this means that the gradients also need to be masked on the back-pass (because we didn't actually forward through F.dropout
), so the BackHook
hooks the backward pass and does a similar masking on the gradient tensors. I'm still not 100% sure I got this completely right, so I'd be interested in your eventual results (seems to be behaving correctly and in the way you'd expect for a regularizer for me, but that's weak evidence of actual correctness!)
I have run the word-PTB LSTM model, and reached 74.54 PPL at the point where the code changes the optimizer to ASGD (and then it broke with KeyError: 'ax'
on prm.data = optimizer.state[prm]['ax'].clone()
). That is close to what I got earlier (around 70-72), though it does not really agree with the ablation analysis in the paper, which reports 66 without ASGD (but maybe with fine-tuning, so who knows).
BTW QRNN stops at around 770 PPL, so that also needs to be properly updated to 1.0...
I guess I'll just go back to 0.4 for now to be on the safe side.
@DavidNemeskey I am now pretty confident that the approach is working correctly. I have retrained an NER model based on the Lample paper from 2017 with my modified version of this class, and am able to recover the same model performance as before
@sdraper-CS I ran both the original and your code under Pytorch 0.4, and found the following:
main.py
to enumerate named_parameters()
and exclude everything that has _raw
in the name, because apparently those parameters are not part of optimizer.state
. Do you know why? Shouldn't everything that is returned by parameters()
(and has a grad) be optimized?So I guess it works, it's just that the hyperparameters might need recalibration.
@DavidNemeskey That's odd. I'm not sure why the raw_
parameters would not be in the optimizer (as you say anything the model.parameters()
enumerates should be in the optimized set), however they will always receive 0 gradient [directly anyway] since they are not part of the froward-prop'd graph (that's why we have to copy values from them on forward, and to their gradients on backward). However, because we perform this copying to the gradient on the back hook they SHOULD have gradients by the time the optimizer sees them (and it should have them in its optimization set - the presence of the 'real' parameters there also is redundant since any updates to THOSE weights are discarded by the value copy on the forward pass [removing those from the optimized set is an optimization I really should make sometime]).
I'll run my code through and take a look at the optimizer set I see in the debugger (at least for CPU, though at that level it shouldn't matter) to see if I can see the same issue as you.
It's POSSIBLE that this may manifest with some optimizers but not others - I have not experimented widely (my model just uses SGD with Nesterov momentum + an annealing schedule). I'll see what I can find and get back to you when I have more information
@sdraper-CS I did another experiment and replaced the line
self.register_parameter(name_w + '_raw', Parameter(w.data))
with just
setattr(self, name_w + '_raw', w.data)
i.e. the _raw
things are now not parameters at all. Consequently, ASGD doesn't blow up (as the _raw
tensors are not returned by parameters()
), AND the code works (i.e. I get similar results to when I manually excluded _raw
parameters in main.py
). I am still trying to understand why...
@DavidNemeskey That really doesn't make sense to me! Stepping through in the debugger I AM getting the _raw variants in the optimizer params (for both SGD and Adam), and it SHOULD be necessary to register the raw variants as Parameters (so I cannot explain your observations). To provide some framework for analysis, here is a description of exactly how it is intended to work during the forward and backward training passes:
Setup:
Forward pass:
Backward pass:
It is thus critical that the raw parameters are part of the optimized set. If they were not the expected behavior would be that we never learn anything, since the raw weights would not be updated and we'd continue to copy whatever value they were initialized with into the underlying LSTM on each forward pass.
The above analysis does highlight one subtle point, which is that any weight initialization you intend to apply to the LSTM needs to be applied BEFORE the LSTM is wrapped inside a WeightDrop
wrapper (else you'll be initializing weight that end up not actually being used, and the effective initialization will be zeros). I also think I might have a bug is the gradient normalization, since the mask produced by Dropout
is weighted (so the non-0 elements have . weight that normalizes the mean), but because I reuse the same mask to mask the gradients on the back-pass I'm probably double-counting the normalization (I'll need to look into that more).
Sorry I cannot explain your exact findings, but hopefully the above explanation will help your analysis of what is happening in your case
Have anyone checked fast ai implementaion for pytorch 1.0 ? https://github.com/fastai/fastai/blob/master/fastai/text/models/awd_lstm.py
@NProkoptsev You probably already know this by now, but just for everyone else who sees this: the fastai implementation works for PyTorch 1.0.
@daemon You are right, it works, but it cannot reproduce the numbers in the paper either. I think that boat has sailed with Pytorch 0.4; at least until someone does a full hyperparameter search for 1.0.
Hi,
When running
python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save PTB.pt
I get the following error:I'm using PyTorch 1.0. Any idea why this is happening?
Thanks!