measurement difference - Githubissues

huyson1810 commented 1 year ago

Dear authors, I'm very interested in your research and want to reinstall it to understand better. However, I am having problems generating the evaluation results [ndcg10,ndcg20,recall10,recall20]. Normally, when increasing from top-10 to top-20, the evaluation results will increase, but here they decreased. Can you demystify it for me? Thank you so much.

tanatosuu commented 1 year ago

Thank you for your interest in our research. Actually, if you take a look at the definition of recall@k and ndcg@k, the larger k does not necessarily result in a higher accuracy. For example, there are 20 instances in the test test, and TP@10=3 (TP: True positive), so recall@10=0.3. But if there are fewer TP from 10 ~ 20, then recall@20<recall@10. I think the reason causing the decreasing as increasing k has something to do with the dataset split. In our paper, we chose a sparse setting with training=20%. I also conducted experiment on training=80% later, and the accuracy increase as k increasing.

There is one place in our evaluation that differs from original recall@k and ndcg@k, is that we let k=min(k, the number of test samples) for each user, to assure the maximum of recall@k and ndcg@k is 1, through the code:

all10=10 if(test_lens>10) else test_lens all20=20 if(test_lens>20) else test_lens

You can simply change it to "all10=10, all20=20", if you want to compare the result to other methods with original recall and ndcg evaluation.

Based on my above answers, you may want to evaluate GDE under the usual setting training=80% extensively used in many papers. Here is some information unrelated to your questions. I found that actually, GDE tends to perform not that good as increasing the training ratio especially without the adaptive loss proposed in Sec 3.4, and I think it has something to do with the training parameters. Under the sparse setting training=20%, model training has a positive effect, while as increasing the training ratio to 60, or 80%, the model training seems to hinder the model convergence. Actually, if you simply remove the user/item embedding when evaluating on the data training=80%, the performance is even better. You can try some unparameterized methods [1, 2], you will find that they easily beat many so called “SOTA” models, while they performs badly under the sparse setting such as training=20%, which is why I chose a sparse training setting in this work.

This observation only holds for collaborative filtering task since it has no other side info can be used. I just do not want you to be confused if the accuracy does not match the one reported in the paper under other training settings.

[1] How Powerful is Graph Convolution for Recommendation? [2] Embarrassingly Shallow Autoencoders for Sparse Data

huyson1810 commented 1 year ago

Thank you very much for your detailed instructions !!! I am going to research these issues.

tanatosuu / GDE

measurement difference #5