tuzhucheng / MP-CNN-Variants

Variants of Multi-Perspective Convolutional Neural Networks
23 stars 12 forks source link

How to get similarity score with 2 sentences test #2

Open briancannon opened 6 years ago

briancannon commented 6 years ago

The model's output is a torch.cuda.FloatTensor. How can I get real score between 2 sentences?

tuzhucheng commented 6 years ago

Check out this line: https://github.com/tuzhucheng/MP-CNN-Variants/blob/002db7a77bf780b531cf68945c1f073d45185f04/evaluators/sick_evaluator.py#L33

briancannon commented 6 years ago

I tried out and got the score. But when I split test data set to smaller sets (size = 64 pair sentences) and try to evaluate each of them. I get different results:

INFO - Evaluation metrics for test INFO - pearson_r spearman_r KL-divergence loss INFO - test 0.587159 0.65102088053 1.398514747619629

INFO - Evaluation metrics for test INFO - pearson_r spearman_r KL-divergence loss INFO - test -0.0634823 -0.0976152631988 1.9832178354263306

INFO - Evaluation metrics for test INFO - pearson_r spearman_r KL-divergence loss INFO - test 0.680005 0.517980672901 1.0506935119628906

Why is that? Are the model correct?

tuzhucheng commented 6 years ago

You mean evaluating each batch of test set sentences consisting of 64 sentence pairs?

briancannon commented 6 years ago

Yes, I did. I just want to evaluate with different and smaller test data sets, not in order or somethings like that.

tuzhucheng commented 6 years ago

You showed three different sets of "Evaluation metrics for test". I'm guessing you are wondering why the results differ so much.

Do you mind explaining what you did to get the pearson_r, spearman_r, etc.. for those three sets of data?

briancannon commented 6 years ago

You right. That's why I wonder. Test data set has more 4000 sentence pairs. I try to evaluate with 3 smaller data set, each of them has 64 sentence pairs. Then I got different pearson_r and spearman_r results.

Could you explain to me? Thanks.

tuzhucheng commented 6 years ago

How many epochs did you train for?

If the model is not trained very well (high bias in training set) then we can expect to get poor results on the smaller test sets. They vary wildly since there is variation in the different small test sets you created. However, after you train the model properly (low bias in training and dev set), I think you can expect to see better test set metrics and more consistent performance among different test sets. Note for the model to be trained well the hyperparameters also play an extremely important role.

briancannon commented 6 years ago

I trained with: python main.py mpcnn.sick.model --dataset sick --epochs 19 --epsilon 1e-7 --dropout 0 And got (full test data set): INFO - Evaluation metrics for test INFO - pearson_r spearman_r KL-divergence loss INFO - test 0.867389 0.808621796372 0.46649816802241434

You can use split -l 64 a.txt split_a.txt, then randomly select one of them to evaluate and see the result. I tried this because when printing predictions.append((predict_classes * output.data.exp()).sum(dim=1)), I find out the similarity score pretty different with the expected result.

tuzhucheng commented 6 years ago

Hmm, sorry I missed the notification.

Doing some error analysis is on my TODO list.