recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
18.9k stars 3.08k forks source link

[BUG] Evaluation function ndgc_at_k returns incorrect values #1749

Closed riosv closed 11 months ago

riosv commented 2 years ago

Description

The function ndcg_at_k returns incorrect values.

For example, using the example described in https://en.wikipedia.org/wiki/Discounted_cumulative_gain:

In response to a search query, the system returns 6 documents, labelled from D1 to D6. The user provides the true relevance scores for those documents as well as for two more documents (D7 and D8) as follows: 3, 2, 3, 0, 1, 2, 3, 2.

As described in the wiki page, the correct value of NDCG@6 should be 0.785 but ndcg_at_k returns 1.0

The cause could be that in the function's body, DCG and NDCG are calculated always using 1 as numerator while the formula requires to use the true relevance score. For example this is how DCG is calculated:

df_dcg["dcg"] = 1 / np.log1p(df_dcg["rank"])

In which platform does it happen?

MacOS Python 3.9.12 Recommenders 1.1.0

How do we replicate the issue?

import math

import numpy as np
import pandas as pd
from recommenders.evaluation.python_evaluation import ndcg_at_k

df_true = pd.DataFrame(
    {
        "user_id": np.full(8, 0, dtype=int),
        "item_id": np.arange(8),
        "rating": np.asarray([3, 2, 3, 0, 1, 2, 3, 2]),
    }
)

df_pred = pd.DataFrame(
    {
        "user_id": np.full(6, 0, dtype=int),
        "item_id": np.arange(6),
        "prediction": np.asarray([6, 5, 4, 3, 2, 1]),
    }
)

score = ndcg_at_k(
    rating_true=df_true,
    rating_pred=df_pred,
    col_user="user_id",
    col_item="item_id",
    col_rating="rating",
    col_prediction="prediction",
    k=6,
)

print(score)
assert score == math.isclose(0.78500, 1e-12)

Expected behavior (i.e. solution)

For the example described above ndcg_at_k should return 0.785

Other Comments

deep-chakraborty commented 2 years ago

I would like to add the following points:

In order to understand it better I made a small toy example out of your example notebook here and tried to test them with your functions

import pandas as pd
from recommenders.evaluation.python_evaluation import ndcg_at_k

df_true = pd.DataFrame(
    {
        "user_id": [1, 1, 1],
        "item_id": [1, 2, 3],
        "rating": [5, 4, 3]
    }
)

df_pred = pd.DataFrame(
    {
        "user_id": [1, 1, 1],
        "item_id": [3, 2, 1],
        "prediction": [14, 13, 12]
    }
)

K = 3

HEADER = {
    "col_user": "user_id",
    "col_item": "item_id",
    "col_rating": "rating",
    "col_prediction": "prediction",
}

print(f"The ndcg at k={K} is {ndcg_at_k(df_true, df_pred, k=K, relevancy_method='top_k', **HEADER)}")
# output: The ndcg at k=3 is 1.0

The results look really strange because I intentionally reversed the ordering of the recos and as far as I understood the whole point of using this metric was to catch if more important/relevant documents were ranked lower by the recommender then it should be penalized more.

I also tried to repeat the same activity using the SparkRankingEvaluation class:

from recommenders.evaluation.spark_evaluation import SparkRankingEvaluation
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df_pred_sp = spark.createDataFrame(df_pred)
df_true_sp = spark.createDataFrame(df_true)

print(f"The ndcg at k={K} is {SparkRankingEvaluation(df_true_sp, df_pred_sp, k=K, relevancy_method='top_k', **HEADER).ndcg_at_k()}")
# output: The ndcg at k=3 is 1.0

On a separate note the ndcgAt(k) method of pyspark.mllib.evaluation.RankingMetrics only implements the exponential form but also uses the natural logarithm which is also not consistent as per the wiki

Note that Croft et al. (2010) and Burges et al. (2005) present the second DCG with a log of base e, while both versions of DCG above use a log of base 2. When computing NDCG with the first formulation of DCG, the base of the log does not matter, but the base of the log does affect the value of NDCG for the second formulation. Clearly, the base of the log affects the value of DCG in both formulations.

Please correct me if I misunderstood.

A few references:

Please treat this issue with a higher priority as this is an important metric that is being used to compare all the benchmarks.

OFF TOPIC FEATURE REQUEST

Platform details: Ubuntu-20.04 Python 3.8.10 numpy==1.22.4 pandas==1.4.2 pyspark==3.3.0 recommenders==1.1.0

deep-chakraborty commented 2 years ago

Could we have an update here?

yueguoguo commented 2 years ago

@riosv I guess the example on the wiki page is different from our case, in terms of how "relevance" is defined. That is, on the wiki page, the document has 4 different types of relevance scores, which are 0, 1, 2, and 3, while in our case, the relevant score has only 0 and 1 where 1 indicates a hit of the recommendation. 0 leads to no gain here so when calculating the cumulative gain, only 1 is counted.

yueguoguo commented 2 years ago

@deep-chakraborty in your example, the NDCG is scored as 1.0 because the predictions can always find a "relevance" in the ground truth no matter how they change the order. If you can mock an irrelevant item in the predictions, say "item 4", and position it into different places in the prediction dataframe, then you should be able to see the matter of order.

deep-chakraborty commented 2 years ago

@yueguoguo

@deep-chakraborty in your example, the NDCG is scored as 1.0 because the predictions can always find a "relevance" in the ground truth no matter how they change the order.

So if I understood you correctly - changing the order of the recommendations shouldn't affect the score? This does not sound very intuitive tbh. Also if relevance being considered is always 1.0 then that means that the ground-truth items also don't have any ordering as they should always be ordered according to their "relevance" and yes the same relevance that is used to order them should be used to calculate the DCG of the recommendations.

To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of all known relevance judgments.

Could you please cite a paper or example which you are using as reference for your implementation if its not the wiki?

yueguoguo commented 2 years ago

@deep-chakraborty the implementation was from Spark MLlib as stated in the docstring.

For your question, I think that is a known limitation of NDCG. See the following from wiki page.

Normalized DCG does not penalize for missing documents in the result. For example, if a query returns two results with scores 1,1,1 and 1,1,1,1,1 respectively, both would be considered equally good, assuming ideal DCG is computed to rank 3 for the former and rank 5 for the latter. One way to take into account this limitation is to enforce fixed set size for the result set and use minimum scores for the missing documents. In previous example, we would use the scores 1,1,1,0,0 and 1,1,1,1,1 and quote nDCG as nDCG@5.

That is to say, in your example, all the prediction results get the relevance scores of 1, 1, 1, meaning that there is no point to rank them unless there is a missing or false prediction in the results, i.e., 0. I think the meaning behind this is that, when we recommend k items to users, we only care about the order when there are irrelevant items recommended.

deep-chakraborty commented 2 years ago

@yueguoguo

Thanks for citing the wiki I am indeed aware of the limitations of the nDCG.

I am not sure we understand each other completely. My concern was not regarding "missing" documents but rather the "order" of the documents. I would like to ask again - how do we "sort" the ground truth? I hope we are not confusing between "hit" and "ranking" metrics here. It has been correctly defined in the example notebook.

image

The "relevance" should be always defined using the scores/ratings that the user has given.

Secondly, about the reference - I don't think you are completely following the Spark MLlib's implementation here, they are using the exponential form. In addition to this, I also pointed out an inconsistency in the Spark's implementation as well regarding the selection of the base of the logarithm.

yueguoguo commented 2 years ago

Thanks @deep-chakraborty. Here are my thoughts.

Hope I it makes sense.

deep-chakraborty commented 2 years ago

@yueguoguo

I am struggling to understand these two statements

If the top-k results contain the irrelevant items then order matters because we want to know how "high" the position is with regard to the predicted relevant item in the top-k;

If I understood you correctly you say "order only matters in case of irrelevant predicted items". Is it correct?

otherwise we don't need to consider the order because the user will anyway interact with all of them

This definitely is probably something that is a "version" of nDCG implemented only for this repo and does not align with many use cases out there. I think we should take a step back and revisit why are we trying to measure the goodness of the model with this metric anyway.

For example, if a user has seen 10 movies out of 1000 movies on Netflix then it is not fair to assume that the user "is going to interact anyway" with all those 900 movies that he has not watched. In that case, it is very important that the sorting of the recommended movies are "correct".

For evaluation purposes, we sort the ground truth, let's say, 200 of the remaining 900 movies according to the ratings (explicit feedback) given by the user and that is how the "actual relevance" of each movie is taken into account. We ask, any general model/algorithm, to tell us which movies should the user watch today (first), next week (second), and so on for these 200 movies based on its own internal interpretation/representation of importance expressed as "predicted ratings". Now as we already know in what sequence/order the user would "actually" watch the movies we wish to compare the two ordered lists/sets and our goal is to quantify the difference. There are many ways to quantify it of which nDCG is one.

So if we are unable to measure this "goodness" with the implementation of the nDCG done in this repo then I can't think of any good reason why this implementation should be used anyway especially considering the fact that the "hit" related ratings (as shown in the example notebook and in the screenshot) do exactly the same thing.

In the light of these points I would strongly suggest you please take the sklearn.metrics.ndcg_score as reference instead of Spark MLlib as they have also taken into account the relevant papers (follow the discussion thread here) OR use it directly under the hood and just have a wrapper that handles the spark/pandas dataframes.

deep-chakraborty commented 2 years ago

A sample implementation from my end would be the following

from math import log2
from typing import List
from typing import Literal
from typing import Optional
from typing import Union

import pandas as pd
from sklearn.metrics import ndcg_score

def _ndcg(
    true: List[Union[str, int]],
    pred: List[Union[str, int]],
    ratings: Optional[List[Union[int, float]]] = None,
    k: int = 3,
    form: Literal["linear", "exp", "identity"] = "linear",
) -> float:

    k_ = min(len(true), len(pred), k)
    if ratings:
        assert len(ratings) >= k_, "All true labels must have a rating"
        assert ratings == sorted(ratings, reverse=True), "Ratings must be sorted in descending order"
        rel_ = lambda x: ratings[true.index(x)] if x in true else 0  # noqa
    else:
        # make "rank" as the rating - useful in case of implicit feedback cases
        rel_ = (
            lambda x: (len(true) - true.index(x) + 1) if x in true else 0
        )  # noqa

    if form == "exp":
        rel = lambda x: 2 ** rel_(x) - 1  # noqa
    elif form == "identity":
        # this is the current implementation
        rel = lambda x: 1.0 if x in true else 0.0  # noqa
    else:
        rel = rel_
    disc = lambda i: log2(i + 2)  # noqa

    dcg_at_k_ = ideal_dcg_at_k_ = 0
    for i, (ele, ele2) in enumerate(zip(pred[:k_], true[:k_])):
        dcg_at_k_ += rel(ele) / disc(i)  # type: ignore
        ideal_dcg_at_k_ += rel(ele2) / disc(i)  # type: ignore

    return dcg_at_k_ / ideal_dcg_at_k_

def ndcg(
    df_true: pd.DataFrame,
    df_pred: pd.DataFrame,
    user_col_name: str = "user_id",
    item_col_name: str = "item_id",
    rating_col_name: str = "rating",
    prediction_col_name: str = "prediction",
    top_k: int = 3,
    form: Literal["linear", "exp", "identity"] = "linear",
    consider_actual_ratings: bool = False,
    show_debug_output: bool = False,
) -> float:
    def agg_sort(df: pd.DataFrame, sort_col_name: str):
        return (
            df.sort_values(by=[user_col_name, sort_col_name], ascending=False)
            .groupby(user_col_name)
            .agg({item_col_name: list, sort_col_name: list})
            .rename(
                columns={
                    item_col_name: "pred" if "prediction" in sort_col_name else "true"
                }
            )
        )

    df_pred_ = agg_sort(df_pred, prediction_col_name)
    df_true_ = agg_sort(df_true, rating_col_name)
    df_final_ = df_true_.join(df_pred_, on=user_col_name, how="inner")

    if consider_actual_ratings:
        df_final_["ndcg"] = df_final_.apply(
            lambda x: _ndcg(x["true"], x["pred"], x["rating"], k=top_k, form=form),
            axis=1,
        )
    else:
        df_final_["ndcg"] = df_final_.apply(
            lambda x: _ndcg(x["true"], x["pred"], k=top_k, form=form), axis=1
        )

    if show_debug_output:
        print(df_final_)
    return df_final_["ndcg"].mean()

if __name__ == "__main__":
    df_true = pd.DataFrame(
        {
            "user_id": [1, 1, 1],
            "item_id": ["a", "b", "c"], 
            "rating": [5, 4, 3],
        }
    )
    df_pred = pd.DataFrame(
        {
            "user_id": [1, 1, 1],
            "item_id": ["a", "b", "c"],
            "prediction": [13.1, 14.2, 12.1],
        }
    )

    assert round(ndcg(
            df_true,
            df_pred,
            form="linear",
            consider_actual_ratings=True
        ), 6) == round(
            ndcg_score(
                y_true = [[5, 4, 3]], 
                y_score= [[13.1, 14.2, 12.1]]
            ),
            6
        )

Ofcourse, there are many corner cases that have not been taken care of in the above implementation but it is quite simple to understand and is consistent with other benchmark implementations out there (sklearn in this case). Hope it makes things clearer also would be happy to contribute if needed.

yueguoguo commented 2 years ago

@yueguoguo

I am struggling to understand these two statements

If the top-k results contain the irrelevant items then order matters because we want to know how "high" the position is with regard to the predicted relevant item in the top-k;

If I understood you correctly you say "order only matters in case of irrelevant predicted items". Is it correct?

otherwise we don't need to consider the order because the user will anyway interact with all of them

This definitely is probably something that is a "version" of nDCG implemented only for this repo and does not align with many use cases out there. I think we should take a step back and revisit why are we trying to measure the goodness of the model with this metric anyway.

For example, if a user has seen 10 movies out of 1000 movies on Netflix then it is not fair to assume that the user "is going to interact anyway" with all those 900 movies that he has not watched. In that case, it is very important that the sorting of the recommended movies are "correct".

For evaluation purposes, we sort the ground truth, let's say, 200 of the remaining 900 movies according to the ratings (explicit feedback) given by the user and that is how the "actual relevance" of each movie is taken into account. We ask, any general model/algorithm, to tell us which movies should the user watch today (first), next week (second), and so on for these 200 movies based on its own internal interpretation/representation of importance expressed as "predicted ratings". Now as we already know in what sequence/order the user would "actually" watch the movies we wish to compare the two ordered lists/sets and our goal is to quantify the difference. There are many ways to quantify it of which nDCG is one.

So if we are unable to measure this "goodness" with the implementation of the nDCG done in this repo then I can't think of any good reason why this implementation should be used anyway especially considering the fact that the "hit" related ratings (as shown in the example notebook and in the screenshot) do exactly the same thing.

In the light of these points I would strongly suggest you please take the sklearn.metrics.ndcg_score as reference instead of Spark MLlib as they have also taken into account the relevant papers (follow the discussion thread here) OR use it directly under the hood and just have a wrapper that handles the spark/pandas dataframes.

Quick response - I agree - this is in line with my points in the earlier response. That is, if ratings are considered in relevance, we should involve them in calculating nDCG and that gives us a more in-depth insight about the recommendation quality w.r.t user preferences, i.e., ratings. NOTE we still want to keep hits-based approach because not all the times explicit ratings are obtainable and reliable.

I am thinking of an extension of the existing implementation. Gratitude to your reply with codes where a possible implementation is proposed 😄 I guess something we can consider here is to make it generic, e.g., the function ndcg offers a parameter that allows the user to apply either hits-based calculation or rating-based calculation, depending on whether rating is available or not.

@miguelgfierro @gramhagen @anargyri you may want to take a look at the thread.

deep-chakraborty commented 2 years ago

@yueguoguo

Thank you for your response. I would like to add one point regarding

e.g., the function ndcg offers a parameter that allows the user to apply either hits-based calculation or rating-based calculation, depending on whether rating is available or not.

For that we can simply pass form="identity" in my implementation. Looking forward to more feedback. :)

anargyri commented 2 years ago

As @yueguoguo said, we use Spark MLLib definitions, a similar question arose with other metrics like MAP, see #1702. The main thing is that in the absence of any relevance information one needs to use the constant value 1. And it is not known how the data scientist will compute relevance in their specific scenario, it could be based on ratings or it could be something different. In fact, ratings are not available in all recommendation scenarios. I view this as a potential feature to add i.e. we may want to include an optional dataframe argument for the relevance values in the metrics definitions, with the default being the current computation.

deep-chakraborty commented 2 years ago

@anargyri

Thanks a lot for participating in this thread. Regarding the point that you raised -

And it is not known how the data scientist will compute relevance in their specific scenario, it could be based on ratings or it could be something different

I think for the datasets (that is the Movielens in most cases) where we are not in an "implicit feedback" scenario and already have the ratings from the user it is quite fair to assume that we use these ratings to sort the ground truth list of items. More so when we are using this dataset to benchmark most SOTA reco methods.

In fact, ratings are not available in all recommendation scenarios.

I totally agree and this is indeed the case with my current projects in my job right now.

anargyri commented 11 months ago

If no objections, I will close this issue since the above PR has been merged. @miguelgfierro @loomlike

miguelgfierro commented 11 months ago

Yeah