tlatkowski / multihead-siamese-nets

Implementation of Siamese Neural Networks built upon multihead attention mechanism for text semantic similarity task.
MIT License
182 stars 43 forks source link

Overfitting with CNN model? #13

Open datistiquo opened 4 years ago

datistiquo commented 4 years ago

Hey,

I try the CNN model for my own data and I don't know what is going on there. I really hope you can get me some advices.

I use the model for sentences Matching for IR. I get good reuslts for the trained data but for out of scope I get very high confidences with not related sentences. Even for an empty string I get confidences of 1 for several sentences!

I have not so much data so I do augmenation. Do you have any recipe for the augmenation?

Thank you!

tlatkowski commented 4 years ago

Hi @datistiquo ,

sorry for the late response, have you tried any regularization techniques? and have you faced with overfitting only for CNNs?

Looking into the model configuration you can see that the dropout is disabled by default for CNNs, During the implementation i was not sure if dropout is a good regularization technique for this kinds of models (siamese-nets) so i disabled it by default. It is also possible that dropout can be useful but only for specific layers but i haven't investigated it.

The second important think that comes to my mind is the maximum length of training sequence. Imagine situation when you have a small training dataset and one or only several sentences are very long, like 50 tokens and the rest sentences are short (also those from tests). In this case short sentences are padded by a lot of placeholder tokens and it can be a strong signal in making the final decision. This area is also worth investigating.

I hope it will help, BR Tomasz

datistiquo commented 4 years ago

I will check this.

I also think that the margin plays a huge role with contrastive loss.

Actually, have you normalized your word vectors before input? Maybe that is my issue too since I have not normalized them. maybe I try this out.

Frank-Sin99 commented 4 years ago

Right now I use a simple MSE or simple contrastive loss. But I feel that I need to do a pairwise or triplet or even a listwise loss to do better?

Also, my metric to evaluae is just precision but ranking metric like precision at k is more reasonable for IR I think!

Frank-Sin99 commented 4 years ago

Hey @tlatkowski Why are you using in your CNN Network just the distance as output? Have you tried feeding the distance to a sigmoid layer? Or instead of using distance using directly the sigmoid layer?