openai / generating-reviews-discovering-sentiment

Code for "Learning to Generate Reviews and Discovering Sentiment"
https://arxiv.org/abs/1704.01444
MIT License
1.51k stars 379 forks source link

Weight Regularization/hidden state clipping parameters #39

Closed raulpuric closed 6 years ago

raulpuric commented 6 years ago

Is there any plans to release what hyperparameters were used for regularizing the training process.

I've been trying to retrain these weights on amazon reviews and a different dataset using guillitte's implementation as suggested on this repo's README; however, because of the multiplicative nature of the mlstm, the weights tend to overfit and have very high norms. The input->hidden weights tend to be fine and have constant values throughout, but the hidden->hidden weights seem to continually grow in norm throughout the training process as it unearths the patterns of the training corpus.

This is problematic for scenarios when I have a rare character/sequence of chars such as a name in finnish with utf-8 supported accenting/diaresis (eg. Väinämö) that comes up frequently in otherwise english text. If multiple of these names appear in a batch, it causes massive gradient spikes and can lead to gradient explosion in the network, and even if the gradients recover, the net is incapable of getting back to previous performance levels if the gradient spike pushes the weights too far from their local.

Obviously I could make an effort to preprocess this data/drop it and clip activation outputs/their associated gradients (and I have), but it is inconvenient to have to rely on data processing and hope that I thought of all possible data transformations or have to extensively tune clipping hyperparameters.

This explosion doesn't happen with an LSTM model (since it's additive) either after extensive testing, even though it doesn't do too well without preprocessed data.

TL;DR Please release hyperparameters, as the network is too prone to overfitting and training instability to the point where I can't even guarantee a stable training run on amazon reviews (even with your saved weights as initialization). (it's about 1 fail:5 succeed) An LSTM model does worse, but doesn't have these training instabilities.

UPDATE: Nevermind, I saw that weight normalization was mentioned in the paper, despite not being used in the pytorch implementation.