salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch
BSD 3-Clause "New" or "Revised" License
1.96k stars 487 forks source link

RuntimeError: shape '[5290000, 1]' is invalid for input of size 4600 #86

Open beneyal opened 5 years ago

beneyal commented 5 years ago

Hi,

When running python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save PTB.pt I get the following error:

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:179: RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
  self.dropout, self.training, self.bidirectional, self.batch_first)
Traceback (most recent call last):
  File "main.py", line 240, in <module>
    train()
  File "main.py", line 196, in train
    output, hidden, rnn_hs, dropped_rnn_hs = model(data, hidden, return_h=True)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/awd-lstm-lm/model.py", line 81, in forward
    raw_output, new_h = rnn(raw_output, hidden[l])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/awd-lstm-lm/weight_drop.py", line 47, in forward
    return self.module.forward(*args)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 179, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: shape '[5290000, 1]' is invalid for input of size 4600

I'm using PyTorch 1.0. Any idea why this is happening?

Thanks!

sdraper-CS commented 5 years ago

Also hitting this. It's related to the magic around flatten_parameters, which apparently has changed in Pytorch 1.0

I have not yet had time to look into this in detail, but I will probably try to dig deeper

sdraper-CS commented 5 years ago

[EDIT - corrected non-train-time behavior (oops!)]

Here's my hacky (very) solution, which I think is working ok (and should work with both Pytorch 1.0 and earlier versions). It does a little more tensor copying, but in practice they tend not to be huge tensors and its a once-per-minibatch thing, so not much overall impact (at least in my usage):

class BackHook(torch.nn.Module):
    def __init__(self, hook):
        super(BackHook, self).__init__()
        self._hook = hook
        self.register_backward_hook(self._backward)

    def forward(self, *inp):
        return inp

    @staticmethod
    def _backward(self, grad_in, grad_out):
        self._hook()
        return None

class WeightDrop(torch.nn.Module):
    """
    Implements drop-connect, as per Merity et al https://arxiv.org/abs/1708.02182
    """
    def __init__(self, module, weights, dropout=0, variational=False):
        super(WeightDrop, self).__init__()
        self.module = module
        self.weights = weights
        self.dropout = dropout
        self.variational = variational
        self._setup()
        self.hooker = BackHook(lambda: self._backward())

    def _setup(self):
        for name_w in self.weights:
            print('Applying weight drop of {} to {}'.format(self.dropout, name_w))
            w = getattr(self.module, name_w)
            self.register_parameter(name_w + '_raw', Parameter(w.data))

    def _setweights(self):
        for name_w in self.weights:
            raw_w = getattr(self, name_w + '_raw')
            if self.training:
                mask = raw_w.new_ones((raw_w.size(0), 1))
                mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
                w = mask.expand_as(raw_w) * raw_w
                setattr(self, name_w + "_mask", mask)
            else:
                w = raw_w
            rnn_w = getattr(self.module, name_w)
            rnn_w.data.copy_(w)

    def _backward(self):
        # transfer gradients from embeddedRNN to raw params
        for name_w in self.weights:
            raw_w = getattr(self, name_w + '_raw')
            rnn_w = getattr(self.module, name_w)
            raw_w.grad = rnn_w.grad * getattr(self, name_w + "_mask")

    def forward(self, *args):
        self._setweights()
        return self.module(*self.hooker(*args))
DavidNemeskey commented 5 years ago

@sdraper-CS I am very curious as to what this magic actually does and why it is needed. Could you elaborate on that?

DavidNemeskey commented 5 years ago

@sdraper-CS Thanks for your solution, BTW, it seems to work (I am still running it, waiting for the results to see if they match the paper). However, I got an error on this line, because pickle didn't like the lambda:

self.hooker = BackHook(lambda: self._backward())

Changing the line to

self.hooker = BackHook(self._backward)

solved the error.

ink-pad commented 5 years ago

Encountered the following issue with https://github.com/salesforce/awd-lstm-lm/issues/86#issuecomment-447910610 solution.

AttributeError: 'WeightDrop' object has no attribute 'weight_hh_l0_mask'

However, this PR https://github.com/pytorch/pytorch/pull/15766 seems to be working perfectly. I haven't tested it completely though.

sdraper-CS commented 5 years ago

@sdraper-CS I am very curious as to what this magic actually does and why is it needed. Could you elaborate on that?

The issue is that the changes in PyTorch 1.0 make it difficult to emplace a new tensor for the weights on each batch, so instead the idea is to mask the elements of the existing weights tensor in-situ. However, this means that the gradients also need to be masked on the back-pass (because we didn't actually forward through F.dropout), so the BackHook hooks the backward pass and does a similar masking on the gradient tensors. I'm still not 100% sure I got this completely right, so I'd be interested in your eventual results (seems to be behaving correctly and in the way you'd expect for a regularizer for me, but that's weak evidence of actual correctness!)

DavidNemeskey commented 5 years ago

I have run the word-PTB LSTM model, and reached 74.54 PPL at the point where the code changes the optimizer to ASGD (and then it broke with KeyError: 'ax' on prm.data = optimizer.state[prm]['ax'].clone()). That is close to what I got earlier (around 70-72), though it does not really agree with the ablation analysis in the paper, which reports 66 without ASGD (but maybe with fine-tuning, so who knows).

BTW QRNN stops at around 770 PPL, so that also needs to be properly updated to 1.0...

I guess I'll just go back to 0.4 for now to be on the safe side.

sdraper-CS commented 5 years ago

@DavidNemeskey I am now pretty confident that the approach is working correctly. I have retrained an NER model based on the Lample paper from 2017 with my modified version of this class, and am able to recover the same model performance as before

DavidNemeskey commented 5 years ago

@sdraper-CS I ran both the original and your code under Pytorch 0.4, and found the following:

So I guess it works, it's just that the hyperparameters might need recalibration.

sdraper-CS commented 5 years ago

@DavidNemeskey That's odd. I'm not sure why the raw_ parameters would not be in the optimizer (as you say anything the model.parameters() enumerates should be in the optimized set), however they will always receive 0 gradient [directly anyway] since they are not part of the froward-prop'd graph (that's why we have to copy values from them on forward, and to their gradients on backward). However, because we perform this copying to the gradient on the back hook they SHOULD have gradients by the time the optimizer sees them (and it should have them in its optimization set - the presence of the 'real' parameters there also is redundant since any updates to THOSE weights are discarded by the value copy on the forward pass [removing those from the optimized set is an optimization I really should make sometime]). I'll run my code through and take a look at the optimizer set I see in the debugger (at least for CPU, though at that level it shouldn't matter) to see if I can see the same issue as you. It's POSSIBLE that this may manifest with some optimizers but not others - I have not experimented widely (my model just uses SGD with Nesterov momentum + an annealing schedule). I'll see what I can find and get back to you when I have more information

DavidNemeskey commented 5 years ago

@sdraper-CS I did another experiment and replaced the line

self.register_parameter(name_w + '_raw', Parameter(w.data))

with just

setattr(self, name_w + '_raw', w.data)

i.e. the _raw things are now not parameters at all. Consequently, ASGD doesn't blow up (as the _raw tensors are not returned by parameters()), AND the code works (i.e. I get similar results to when I manually excluded _raw parameters in main.py). I am still trying to understand why...

sdraper-CS commented 5 years ago

@DavidNemeskey That really doesn't make sense to me! Stepping through in the debugger I AM getting the _raw variants in the optimizer params (for both SGD and Adam), and it SHOULD be necessary to register the raw variants as Parameters (so I cannot explain your observations). To provide some framework for analysis, here is a description of exactly how it is intended to work during the forward and backward training passes:

Setup:

  1. Raw parameters and underlying RNN parameters are included in the overall model parameters (provided we register the raw parameters)
  2. Optimizer is initialized with the model parameters

Forward pass:

  1. Dropout mask is constructed (and preserved) as the forward pass goes through the DropConnect wrapper
  2. Raw parameters values are multiplied by the mask and the result (masked values) are copied into the underlying LSTMs weights tensor
  3. Forward pass through the underlying LSTM (which now has masked weights) occurs

Backward pass:

  1. Back hook is invoked and copies the gradients from the underlying LSTM weights and masks them according to the dropout mask, copying the result to the raw_parameter gradient
  2. Optimizer step updates the parameters according to the gradients. This will update the raw_parameters according to the copied gradients. It will also (but actually redundantly) update the underlying LSTM weights according to their (unmasked) gradients, but this is actually irrelevant (apart from optimizer performance, so we could improve by removing these from the model parameters reported to the optimizer) because on next forward pass we will anyway overwrite the LSTM weights with the raw weights

It is thus critical that the raw parameters are part of the optimized set. If they were not the expected behavior would be that we never learn anything, since the raw weights would not be updated and we'd continue to copy whatever value they were initialized with into the underlying LSTM on each forward pass.

The above analysis does highlight one subtle point, which is that any weight initialization you intend to apply to the LSTM needs to be applied BEFORE the LSTM is wrapped inside a WeightDrop wrapper (else you'll be initializing weight that end up not actually being used, and the effective initialization will be zeros). I also think I might have a bug is the gradient normalization, since the mask produced by Dropout is weighted (so the non-0 elements have . weight that normalizes the mean), but because I reuse the same mask to mask the gradients on the back-pass I'm probably double-counting the normalization (I'll need to look into that more).

Sorry I cannot explain your exact findings, but hopefully the above explanation will help your analysis of what is happening in your case

NProkoptsev commented 5 years ago

Have anyone checked fast ai implementaion for pytorch 1.0 ? https://github.com/fastai/fastai/blob/master/fastai/text/models/awd_lstm.py

daemon commented 5 years ago

@NProkoptsev You probably already know this by now, but just for everyone else who sees this: the fastai implementation works for PyTorch 1.0.

DavidNemeskey commented 5 years ago

@daemon You are right, it works, but it cannot reproduce the numbers in the paper either. I think that boat has sailed with Pytorch 0.4; at least until someone does a full hyperparameter search for 1.0.