[BUG] Ranking Evaluation Metrics Exceed 1 with "by_threshold" Relevancy Method

recommenders-team / recommenders

Best Practices on Recommendation Systems

https://recommenders-team.github.io/recommenders/intro.html

MIT License

18.8k stars 3.07k forks source link

[BUG] Ranking Evaluation Metrics Exceed 1 with "by_threshold" Relevancy Method #2154

Open mnhqut opened 2 weeks ago

mnhqut commented 2 weeks ago

Description

Hello!

I encountered an issue while evaluating the BPR (Bayesian Personalized Ranking) model with basically the same code provided in the example on a different dataset. Specifically, when using the "by_threshold" relevancy method with ranking metrics, the computed values for precision@k, ndcg@k, and map@k exceed 1, which seems incorrect. This issue does not occur when switching the relevancy method to "top_k."

How do we replicate the issue?

I use the following parameter for BPR (all using the default seed):

bpr = cornac.models.BPR(
    k=200,
    max_iter=100,
    learning_rate=0.01,
    lambda_reg=0.001,
    verbose=True 
)

Using these evaluation

TOP_K = 10
threshold =50
eval_map = map_at_k(test, all_predictions, col_prediction="prediction",
                    relevancy_method='by_threshold', threshold=threshold, k=TOP_K)
eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction="prediction",
                      relevancy_method='by_threshold', threshold=threshold, k=TOP_K)
eval_precision = precision_at_k(
    test, all_predictions, col_prediction="prediction",
    relevancy_method='by_threshold', threshold=threshold, k=TOP_K)

Here is the dataset I test on: https://github.com/mnhqut/rec_sys-dataset/blob/main/data.csv

My result: MAP: 1.417529 NDCG: 1.359902 Precision@K: 2.256466

Willingness to contribute

[ ] Yes, I can contribute for this issue independently.
[x ] Yes, I can contribute for this issue with guidance from Recommenders community.
[ ] No, I cannot contribute at this time.

mnhqut commented 2 weeks ago

I forgot to mention the way I spitted training and testing data was:

train, test = python_stratified_split(df, ratio=0.75)

miguelgfierro commented 2 weeks ago

We need to review this @SimonYansenZhao @anargyri @daviddavo @loomlike and even @yueguoguo

daviddavo commented 2 weeks ago

How do you get the all_predictions? Can you provide the full code to reproduce the issue? Or are you just using the deep dive notebook?

daviddavo commented 2 weeks ago

The problem seems to be that when you use by_threshold, the k at some equation terms remains being top_k

For example, in precision@k:

https://github.com/recommenders-team/recommenders/blob/4f86e4785337d455aa3cb7e8920c3fab9a2a0140/recommenders/evaluation/python_evaluation.py#L496-L496

It divides by 10 (default k value), instead of by 50 (the specified by_threshold value).

The other metrics have similar problems.

Perhaps this is what by_threshold is intended to do. Is it a way of changing how many items you want, even though you are calculating by k?? I don't really understand how by_threshold should work so I don't really know if this is a bug or intended behaviour.

2140

I can solve the bug by just using threshold instead of k when necessary, but then by_threshold and top_k would be exactly the same.

Btw, here is a notebook that replicates the issue in Google Colab

recommenders-team / recommenders

[BUG] Ranking Evaluation Metrics Exceed 1 with "by_threshold" Relevancy Method #2154

Description

How do we replicate the issue?

Willingness to contribute

2140