salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch
BSD 3-Clause "New" or "Revised" License
1.96k stars 488 forks source link

Confused regarding motivation of randomized BPTT #33

Closed PetrochukM closed 6 years ago

PetrochukM commented 6 years ago

Why does this exist?

bptt = args.bptt if np.random.random() < 0.95 else args.bptt / 2.

Yall already have...

seq_len = max(5, int(np.random.normal(bptt, 5)))

ccarter-cs commented 6 years ago

This is done to help with spreading the start/stop positions of a token such that it ends up in different positions relative to the BPTT window each epoch. See "4.1 Variable length backpropagation sequences" Merity et al. 2017. https://arxiv.org/abs/1708.02182

The max is to prevent < 5 sequence lengths.

PetrochukM commented 6 years ago

Spreading the BPTT length with some distribution makes sense. It's not clear why you need to employ both distributions on seq_len.

The first code snippet is supported by this statement:

We use a random BPTT length which is N (70, 5) with probability 0.95 and N (35, 5) with probability 0.05."

The second code snippet is not supported in the paper.

ccarter-cs commented 6 years ago

I'm confused, so I'm going to try to describe the code how I read it in sudo code.

let args.bptt = 70 for each minibatch:   let bptt = 70 95% of the time and 35 5%.   let seq_len = the max of 5 and one random draw from N(bptt, 5)

PetrochukM commented 6 years ago

That is similar to how I read it.

Seems a bit arbitrary to pick the 95 / 5% distribution THEN on top of it to do a normal distribution random draw. Seems like a very specific combination of distributions!

Do you know the thinking here?

Smerity commented 6 years ago

Sorry for the slow reply :) I changed the title from "Confused" as I wanted it to be more informative - hopefully that's alright.

Don't read too much into the selected values (other than reproducibility) - I primarily want to highlight the importance of different starting points for batches rather than them remaining static due to a fixed BPTT value.

To explain the reasoning, imagine you were training the enwik8 model. The BPTT value is 200. If we only "fuzz" that value by normal(200, 5) as we're walking along, the starting points you have will almost always converge to similar starting points as the mean of the "fuzz" is zero.

We could increase that normal "fuzz" from 5 to 50 - but then our batch size is not dependable and the model may not take advantage of the GPU.

As such, we can stack two "fuzzes" on top of each other - one which is a small variation (normal(0, 5)) and a larger variation (occasionally using a far small batch which may not effectively use the GPU but is reasonably infrequent).

At least for the problems and BPTT values I'm interested in, the above tactics fairly strongly encourages diversity of starting points and prevents a random walk just close to the starting point origins whilst also ensuring the GPU has strong utilization.

Also, pro-tip, if you hit random out of memory issues, the line below these caps the BPTT to prevent occasional larger BPTTs when normal(0, 5) rolls a large number =]

keskarnitish commented 6 years ago

Closing this issue now, please feel free to re-open if we didn't sufficiently answer your questions.

PetrochukM commented 6 years ago

@Smerity This was very helpful to read. I get it now. It's a trade-off between "starting points" and GPU utilization. To the naive eye, it looked like two distributions on top of each other that could be replaced by something "fuzz from 5 to 50".

@keskarnitish Thanks for following up on this.