usaito / unbiased-implicit-rec-real

(WSDM2020) "Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback"
Apache License 2.0
30 stars 5 forks source link

About the evaluation process #2

Open Zziwei opened 4 years ago

Zziwei commented 4 years ago

Thank you for the nice work and for sharing your code.

I have a question about the evaluation process. I found that for the test set of Yahoo dataset, you make user-item pairs with scores larger than 3 as relevant (as 1). But after transforming the explicit data to implicit version, you do not filter out users who have 0 relevant items. In this case, during the evaluation, you still count the metrics (recall, dcg, map) for these users when average the metrics of all users. I do not think this makes sense because we cannot evaluate the learned model by these users who have 0 liked items. And I think if you remove these users in the test set, the metric values should be higher than what you get.

Besides, you also need to consider to remove users in the test set if the corresponding users do not have liked items in the training set.

The experimental setup in the paper 'Unbiased Offline Recommender Evaluation for Missing-Not-At-Random Implicit Feedback' is more reasonable.

Am I understanding this problem correctly? What's your opinion?

usaito commented 4 years ago

@Zziwei

Thank you for your comments!

But after transforming the explicit data to implicit version, you do not filter out users who have 0 relevant items. In this case, during the evaluation, you still count the metrics (recall, dcg, map) for these users when average the metrics of all users.

Your understanding is mainly correct, but in our formulation, Y=0 does not always mean the irrelevance between the corresponding user-item pair (R=0), right? (positive-unlabeled problem) Therefore, even if a user does not click any item in the training data, it might be possible that some of them are potentially relevant to that user. Thus, I did not filter out any user to consider the potentially relevant items during the training process.

Besides, you also need to consider to remove users in the test set if the corresponding users do not have liked items in the training set.

I know your way of evaluation is used in many papers on recsys. However, I cannot understand why this preprocessing is critical. For example, it is possible that there are some users who do not have any more relevant items in real-world situations. My overall strategy in the real-world experiment was to use the original form of real-world data as much as possible rather than using unnecessary preprocessing procedures, and thus, did not remove any users in the test set.

I would appreciate it if you could check that the above discussions make sense to you.

Thank you and best regards.