recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
19.07k stars 3.09k forks source link

[BUG] BPR Model Evaluation using 'training' set and not 'test' to compute benchmarks #1061

Closed recsyslabs closed 4 years ago

recsyslabs commented 4 years ago

Description

As in any ML experimental setting, only the test dataset should be used to compute benchmarks. However, in the BPR deep dive notebook :

https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/cornac_bpr_deep_dive.ipynb

In Section 3.4: Prediction and Evaluation, Input cell 8, the following code uses the training set to compute all_predictions variable, which is later used to compute the evaluation metrics:

all_predictions = predict_ranking(bpr, train, usercol='userID', itemcol='itemID', remove_seen=True)

shouldn't the predictions be computed using the test set instead? The evaluation metrics results drastically change if evaluated using the test set alone, as the good practice commands.

In which platform does it happen?

All platforms since this is a potential conceptual ML 'bug'

How do we replicate the issue?

Run the input cell 8 with the correct test data:

all_predictions = predict_ranking(bpr, test, usercol='userID', itemcol='itemID', remove_seen=True)

Expected behavior (i.e. solution)

Use the test dataset for model evaluation, never the training one.

The metrics results will change and need to be updated in the README.md of this repo.

gramhagen commented 4 years ago

@recsyslabs this section could benefit from a bit more description of what's happening. in the line you reference calculation of all_predictions is generating the predictions from the model for all possible user / item pairs (but ignoring those that were observed during training). the only relevant information used by the train data is the unique lists of user ids and item ids.

There is actually a separate problem, that if we removed all observations for a specific item or user during the train test split we might not generate all the necessary user-item pairs, so this line should really use the full dataset:

all_predictions = predict_ranking(bpr, data, usercol='userID', itemcol='itemID', remove_seen=True)

You will note that all the metrics used in the following section do use the test set, which is necessary to get unbiased results as you mention.

@tqtg do you think you could add a bit more to the description of this function to avoid confusion, also do you agree with my assessment that we should use the full dataset here?

tqtg commented 4 years ago

@gramhagen totally agree with your suggestion. I will PR an update to make things more clear as soon as I can.

tqtg commented 4 years ago

@gramhagen I revisited some of the tutorials and notice that all_predictions is supposed to be the prediction scores for all users and items appeared in training data. For example: https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb https://github.com/microsoft/recommenders/blob/master/notebooks/00_quick_start/ncf_movielens.ipynb

I think this behavior is expected: all_predictions = predict_ranking(bpr, train, usercol='userID', itemcol='itemID', remove_seen=True) rather than passing all the data: all_predictions = predict_ranking(bpr, data, usercol='userID', itemcol='itemID', remove_seen=True) because remove_seen=True will cause removing predictions on for the test set as well.

recsyslabs commented 4 years ago

Hi @tqtg & @gramhagen , I still do not understand why is it required to compute for all user and items in the training set. Given the train/test split with a ratio 0.75/0.25, respectively, the goal as far as I understand it from the notebooks, and considering an item prediction task, is to rank the items in the test set, that is, the rank given by the model should be as close as possible as the true rank observed in the test set.

Note that the all_predictions variable is input to the metrics as the predicted labels, but then is compared with the true labels appearing only in the test set. Would not be the correct way to compute the all_predictions with the test set only, as predicted labels, with the flag remove_seen=False, given that the train/test split procedure already enforces the sets to be disjoint ? and then use the resulting all_predictions from the test dataset to compute the metrics.

gramhagen commented 4 years ago

@recsyslabs this instance is a bit different from typical supervised learning case because the training data is sparse.

let's consider this input: user item rating
u1 i1 5
u1 i2 4
u2 i1 4
u3 i1 3
u3 i3 5

If we use an 80/20 split we will exclude one of the ratings, let's assume it is u3-i3.

so train is: user item rating
u1 i1 5
u1 i2 4
u2 i1 4
u3 i1 3
and test is: user item rating
u3 i3 5

When we go to make predictions we need to generate a dense set of predictions for all user-item combos. Only then can we rank the full set of possible recommended items for each user. If we are missing an item during these predictions it will not be included in the ranking and the metrics will be negatively impacted.

In this case if we generate predictions and rankings on the train data we would get something like: so train is: user item true rating predicted rating
u1 i1 5 4
u1 i2 4 3
u2 i1 4 4
u2 i2 - 3
u3 i1 3 4
u3 i2 - 3

So if we evaluate the ranking of held out test row u3-i3 we see item 3 is not in the top 1 spot, or top 2, in fact it's not anywhere in the ranked list for user 3. Nor is item 3 in either one of the other user rankings, so they are impacted too.

The assumption being made here the recommender system knows all possible items at the time of recommendation and we want to evaluate all of those recommendations against a known set of users. So, to me it would make this clearer if the all_predictions method took in a list of all users and a list of all items (or perhaps a list of test users to be more efficient), plus the option to remove observed ratings. The reason remove_seen is still valid is because in an actual system there's no point trying to predict ratings that user has already provided. Of course we could look at how close the predicted ratings are to the truth, but using the training data for this will likely lead to overconfident estimates, so for this reason those training user-item combos are ignored.

Depending on the use case you may or may not want to recommend items a user has previously rated, for example a movie recommendation system typically does not recommend items the user has previous rated (an assumed to have watched). But this could be different for another systems like recommending food which a user may want to try again. So based on the use case remove_seen could be disabled to incorporate all items in the ranking.

I hope this clarifies this? @tqtg based on the logic above I feel these other cases are also wrong (though it's possible that the random sampling has provided a full list of user and items so it may not have an actual impact). As I mentioned I think a clearer approach would be to remove the data input and explicitly request the set of users and items to use when generating predictions. This would be something that can be changed in all notebooks.

tqtg commented 4 years ago

@gramhagen indeed all notebooks need to be standardized in terms of evaluation protocol. The models have to give predictions for users and items in the test data even though they didn't appear in the training data. I also notice the benchmark notebook needs to be changed so that all models can provide predictions for unknown users/items. One possible solution is that prepare_training_data function should take both train and test as the arguments.

recsyslabs commented 4 years ago

Thanks @gramhagen your explanation helps a lot to clarify the evaluation protocol. An example like this in the notebooks would be very useful.

marchezinixd commented 4 years ago

I don't know if i should create another issue, but since it is an recommendation system it would be intresting to have a function that recommend the top k in a more efficient way than stacking all userxitems sorting and them filtering. I implemented one myself, but i think this would aggragate to new users.

yueguoguo commented 4 years ago

Evaluation protocol In the benchmark notebook, we standardized data splitting by stratifying items for each user, but this does not guarantee all items will be present in both the training and testing data set. In the CF recommender, many times we have to sample a subset of the whole user-item matrix for training, which may lead to loss of some users and/or items in the original data set.

Remove seen Model trained by using the training data split in the above way does not have knowledge about what items users have interacted by giving only the testing data as input. The whole data should be therefore reproduced to remove the seen ones.