Reducing overfitting with embedding weights??

ishank26 commented 8 years ago

[X] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[X] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps

Hi, I'm working on text prediction task using my pretrained word embeddings. The current model is severely overfitting with increasing val loss. Need your advice on how to mitigate this and the train network properly. I'm using embedding weights in input layer. The embeddings were trained using a larger corpus. The training corpus of model is a subset of it. For out of vocab words I'm using glorot_uniform(270,) to get random embeddings.

Q1- Is it because my network parameters are far greater than train size ?

Q2- Am I using embeddings correctly? Do I have problem with embedding weights?

Q3- What else can I try?

Things I've tried:

Passing train data as val data for bug check. Val loss decreases.
Adding regularization to all layers.
Reducing lstm cells from 256-128
Varying input sequence length 5-32
My model
Trained a word2vec model to get word embeddings.
X_train: sequence of words (indices map using wordvec and oov dictionary)
Using X_train to get embedding weights from word2vec model. i.e embed[i]=word2vec[X_train[i]]
y_train- one hot target vector with 1 at pos of next word wrt to corpus indices i.e not using word2vec+oov dict.

Corpus size: 46265, Corpus vocab size: 22120
Out of vocab words:  15248
Vocab size of Word2vec model + oov words: 34444
Train corpus size:  46169
Test corpus size:  96
X_train.shape:  (46137, 32)
y_train.shape:  (46137, 22120)
embed_weight.shape:  (34444, 270)

w2v_dim= 270
seq_length= 32
corpus_vocab_size= # of unique words in corpus
memory_units=128

model.add(Embedding(embed_weight.shape[0], embed_weight.shape[1], mask_zero=False, weights=[embed_weight], input_length=seq_length, W_regularizer=l2(l2_emb)))
model.add(LSTM(memory_units, return_sequences=False, init= "orthogonal", W_regularizer=l2(l2_lstm)))
model.add(Dropout(0.5))
model.add((Dense(corpus_vocab_size, activation='softmax', init= "orthogonal", W_regularizer=l2(l2_dense)))

Compiling Model
l2_emb:  0.7  l2_lstm:  0.7  l2_dense:  0.7  Dropout:  0.5

Fitting model
Train on 36909 samples, validate on 9228 samples
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 1/40
36909/36909 [==============================] - 135s - loss: 300351.6275 - acc: 0.0223 - val_loss: 9.8617 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 2/40
36909/36909 [==============================] - 134s - loss: 35422.2196 - acc: 0.0231 - val_loss: 9.9594 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 3/40
36909/36909 [==============================] - 135s - loss: 3996.1297 - acc: 0.0231 - val_loss: 10.1249 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
Epoch 4/40
36909/36909 [==============================] - 134s - loss: 254.4328 - acc: 0.0229 - val_loss: 10.3708 - val_acc: 0.0237
('lr:', array(0.0010000000474974513, dtype=float32))
.
.
.
Epoch 38/40
36909/36909 [==============================] - 134s - loss: 9.0427 - acc: 0.0231 - val_loss: 11.7586 - val_acc: 0.0237
('lr:', array(0.0005000000237487257, dtype=float32))
Epoch 39/40
36909/36909 [==============================] - 134s - loss: 9.0425 - acc: 0.0231 - val_loss: 11.7566 - val_acc: 0.0237
('lr:', array(0.0005000000237487257, dtype=float32))
Epoch 40/40
36909/36909 [==============================] - 134s - loss: 9.0423 - acc: 0.0231 - val_loss: 11.7524 - val_acc: 0.0237

@braingineer @farizrahman4u @carlthome your two cents??

Need advice. Thanks !

Edit: optimizer=adam, loss='categorical_crossentropy'

farizrahman4u commented 8 years ago

Mention optimizer and loss too.

ishank26 commented 8 years ago

@farizrahman4u optimizer=adam, loss='categorical_crossentropy'

braingineer commented 8 years ago

why aren't you using masking @ishank26?

braingineer commented 8 years ago

also, your l2 is insanely high. I usually use 1e-6 through 1e-8. https://github.com/braingineer/neural_tree_grammar/blob/master/fergus/configs/premade_confs/language_model.conf#L44 which is used at https://github.com/braingineer/neural_tree_grammar/blob/master/fergus/models/language_model/model.py#L84

jerheff commented 8 years ago

Have you tried first setting the embedding layer to not be trainable, training the higher layers for a few epochs, and then unfreezing the embeddings?

ishank26 commented 8 years ago

@braingineer I'll drop my l2 and revert back. I have few questions- 1) when to use l1 in context to RNN? My understanding is that l1 acts as a feature selector and induces sparsity. 2) And when to use- U_regularizer: instance of WeightRegularizer (eg. L1 or L2 regularization), applied to the recurrent weights matrices

Could you please clarify these doubts. Thanks

ishank26 commented 8 years ago

@jerheff Thanks for the suggestion. What is the intuition behind this type of procedure?

Edit after @carlthome comment: I have used transfer learning/fine-tuning in CNNs, whereby only training bottom layers with non trainable top layers. However, I'm unable to understand why freezing the embedding layer and then unfreezing it after a few epochs will help. Correct me if I'm wrong, but AFAIK embedding layer is just giving dense vector representation of input words. Further training it will finetune these representation for my task.

carlthome commented 8 years ago

@ishank26, freezing pretrained weights during transfer learning is very common, perhaps @jerheff thought of something like that and didn't think of that your word2vec data is static when input to the sequence learner?

ishank26 commented 8 years ago

@carlthome should I train freezing embedding layer? Does this makes sense for my task?

jerheff commented 8 years ago

@carlthome I could be missing something, but isn't the Embedding layer as specified in the comment learned in the model? It seems to me that attaching this directly to randomly initiated layers would be a bad idea until those layers settle down.

On the other hand, if the word2vec transform is happening outside of the model (and not learnable) then it is not something to discuss.

MaratZakirov commented 7 years ago

I must say that in my opinion using ob Embedding layer always leads to over-fitting simply because Embedding DRAMATICALLY increases number of free parameters to learn. Just suppose you have 500K vocabulary for each word 100 floats. In the other case when you use word2vec pretrained and FIXED representation number of free parameters is just equal to number of free parameters of your NN which often quite small.

keras-team / keras

Reducing overfitting with embedding weights?? #4170

My model