Closed ishank26 closed 7 years ago
Mention optimizer and loss too.
@farizrahman4u optimizer=adam, loss='categorical_crossentropy'
why aren't you using masking @ishank26?
also, your l2 is insanely high. I usually use 1e-6 through 1e-8. https://github.com/braingineer/neural_tree_grammar/blob/master/fergus/configs/premade_confs/language_model.conf#L44 which is used at https://github.com/braingineer/neural_tree_grammar/blob/master/fergus/models/language_model/model.py#L84
Have you tried first setting the embedding layer to not be trainable, training the higher layers for a few epochs, and then unfreezing the embeddings?
@braingineer I'll drop my l2 and revert back. I have few questions-
1) when to use l1 in context to RNN? My understanding is that l1 acts as a feature selector and induces sparsity.
2) And when to use- U_regularizer: instance of WeightRegularizer (eg. L1 or L2 regularization), applied to the recurrent weights matrices
Could you please clarify these doubts. Thanks
@jerheff Thanks for the suggestion. What is the intuition behind this type of procedure?
Edit after @carlthome comment: I have used transfer learning/fine-tuning in CNNs, whereby only training bottom layers with non trainable top layers. However, I'm unable to understand why freezing the embedding layer and then unfreezing it after a few epochs will help. Correct me if I'm wrong, but AFAIK embedding layer is just giving dense vector representation of input words. Further training it will finetune these representation for my task.
@ishank26, freezing pretrained weights during transfer learning is very common, perhaps @jerheff thought of something like that and didn't think of that your word2vec data is static when input to the sequence learner?
@carlthome should I train freezing embedding layer? Does this makes sense for my task?
@carlthome I could be missing something, but isn't the Embedding layer as specified in the comment learned in the model? It seems to me that attaching this directly to randomly initiated layers would be a bad idea until those layers settle down.
On the other hand, if the word2vec transform is happening outside of the model (and not learnable) then it is not something to discuss.
I must say that in my opinion using ob Embedding layer always leads to over-fitting simply because Embedding DRAMATICALLY increases number of free parameters to learn. Just suppose you have 500K vocabulary for each word 100 floats. In the other case when you use word2vec pretrained and FIXED representation number of free parameters is just equal to number of free parameters of your NN which often quite small.
Hi, I'm working on text prediction task using my pretrained word embeddings. The current model is severely overfitting with increasing val loss. Need your advice on how to mitigate this and the train network properly. I'm using embedding weights in input layer. The embeddings were trained using a larger corpus. The training corpus of model is a subset of it. For out of vocab words I'm using glorot_uniform(270,) to get random embeddings.
Q1- Is it because my network parameters are far greater than train size ?
Q2- Am I using embeddings correctly? Do I have problem with embedding weights?
Q3- What else can I try?
Things I've tried:
My model
@braingineer @farizrahman4u @carlthome your two cents??
Need advice. Thanks !
Edit: optimizer=adam, loss='categorical_crossentropy'