Open briancannon opened 7 years ago
I tried out and got the score. But when I split test data set to smaller sets (size = 64 pair sentences) and try to evaluate each of them. I get different results:
INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.587159 0.65102088053 1.398514747619629
INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test -0.0634823 -0.0976152631988 1.9832178354263306
INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.680005 0.517980672901 1.0506935119628906
Why is that? Are the model correct?
You mean evaluating each batch of test set sentences consisting of 64 sentence pairs?
Yes, I did. I just want to evaluate with different and smaller test data sets, not in order or somethings like that.
You showed three different sets of "Evaluation metrics for test". I'm guessing you are wondering why the results differ so much.
Do you mind explaining what you did to get the pearson_r, spearman_r, etc.. for those three sets of data?
You right. That's why I wonder. Test data set has more 4000 sentence pairs. I try to evaluate with 3 smaller data set, each of them has 64 sentence pairs. Then I got different pearson_r and spearman_r results.
Could you explain to me? Thanks.
How many epochs did you train for?
If the model is not trained very well (high bias in training set) then we can expect to get poor results on the smaller test sets. They vary wildly since there is variation in the different small test sets you created. However, after you train the model properly (low bias in training and dev set), I think you can expect to see better test set metrics and more consistent performance among different test sets. Note for the model to be trained well the hyperparameters also play an extremely important role.
I trained with:
python main.py mpcnn.sick.model --dataset sick --epochs 19 --epsilon 1e-7 --dropout 0
And got (full test data set):
INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.867389 0.808621796372 0.46649816802241434
You can use split -l 64 a.txt split_a.txt
, then randomly select one of them to evaluate and see the result.
I tried this because when printing predictions.append((predict_classes * output.data.exp()).sum(dim=1))
, I find out the similarity score pretty different with the expected result.
Hmm, sorry I missed the notification.
Doing some error analysis is on my TODO list.
The model's output is a torch.cuda.FloatTensor. How can I get real score between 2 sentences?