hexiangnan / neural_collaborative_filtering

Neural Collaborative Filtering
Apache License 2.0
1.8k stars 655 forks source link

Your evaluation method is unreasonable #39

Open HaoZhang534 opened 5 years ago

HaoZhang534 commented 5 years ago

It's unreasonable to blend the items in test data with negative samples. It contradicts the rule that your evaluation input shouldn't have a knowledge of the test data. I think your method is a cheat which sharply narrows down the scope of the ground-truth. And it's UNFAIR to compare your results with that of eALS and BPR.

hexiangnan commented 5 years ago

You misunderstood the evaluation protocol. We randomly sample X number of negatives, and then blend them it the positive item. Then the model ranks all the X+1 examples, and evaluate the position of the positive example. This is the way of evaluation and has nothing to do with training. There is no leak of knowledge in test data during training.

On Fri, Mar 15, 2019 at 4:35 PM zhanghao notifications@github.com wrote:

It's unreasonable to blend the items in test data with negative samples. It contradicts the rule that your evaluation input shouldn't have a knowledge of the test data. I think your method is a cheat which sharply narrows down the scope of the ground-truth. And it's UNFAIR to compare your results with that of eALS and BPR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hexiangnan/neural_collaborative_filtering/issues/39, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGxjgmidcR6q07-Lr0yyRK64uDtsU1Kks5vW1s-gaJpZM4b19H3 .

hexiangnan commented 5 years ago

Moreover, during the negative sampling stage of training, there is no access of the testing data.

On Fri, Mar 15, 2019 at 5:08 PM Xiangnan He xiangnanhe@gmail.com wrote:

You misunderstood the evaluation protocol. We randomly sample X number of negatives, and then blend them it the positive item. Then the model ranks all the X+1 examples, and evaluate the position of the positive example. This is the way of evaluation and has nothing to do with training. There is no leak of knowledge in test data during training.

On Fri, Mar 15, 2019 at 4:35 PM zhanghao notifications@github.com wrote:

It's unreasonable to blend the items in test data with negative samples. It contradicts the rule that your evaluation input shouldn't have a knowledge of the test data. I think your method is a cheat which sharply narrows down the scope of the ground-truth. And it's UNFAIR to compare your results with that of eALS and BPR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hexiangnan/neural_collaborative_filtering/issues/39, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGxjgmidcR6q07-Lr0yyRK64uDtsU1Kks5vW1s-gaJpZM4b19H3 .

HaoZhang534 commented 5 years ago

But if you only use 4 negative samples you can get a hit rate of 100% in top5 ranking.

hexiangnan commented 5 years ago

Feel free to try if you think you can

Sent from my iPhone

On Mar 15, 2019, at 21:16, noone notifications@github.com wrote:

But if you only use 4 negative samples you can get a hit rate of 100% in top5 ranking.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

xfflzl commented 5 years ago

Actually evaluating model on all the items needs huge computational resources, especially in neural networks. To sample a smaller item pool can accelerate this process, notwithstanding may leading to a more "inflated" result. So pay more attention to the relative comparison between different methods, rather than the absolute values of HR and NDCG.

beopst commented 4 years ago

@hexiangnan Your evaluation protocol indeed leaks the information of negative samples in the test set. With the total number of items, your code actually does a random sampling on the whole indices except for the positive indices to get particular indices to be used as negative samples (see https://github.com/hexiangnan/neural_collaborative_filtering/blob/master/NeuMF.py#L144-L147). So, the model can SEE the observation (i.e., a particular user did not rate an item) which may exist in the test set. It should be carefully considered.

GuoshenLi commented 3 years ago

But if you only use 4 negative samples you can get a hit rate of 100% in top5 ranking.

bro, you totally misunderstand the code and the negative samples used in training and evaluation. the author is right.