Closed riosv closed 11 months ago
I would like to add the following points:
log1p
is being used when as per theory it should be log2
?In order to understand it better I made a small toy example out of your example notebook here and tried to test them with your functions
import pandas as pd
from recommenders.evaluation.python_evaluation import ndcg_at_k
df_true = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": [1, 2, 3],
"rating": [5, 4, 3]
}
)
df_pred = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": [3, 2, 1],
"prediction": [14, 13, 12]
}
)
K = 3
HEADER = {
"col_user": "user_id",
"col_item": "item_id",
"col_rating": "rating",
"col_prediction": "prediction",
}
print(f"The ndcg at k={K} is {ndcg_at_k(df_true, df_pred, k=K, relevancy_method='top_k', **HEADER)}")
# output: The ndcg at k=3 is 1.0
The results look really strange because I intentionally reversed the ordering of the recos and as far as I understood the whole point of using this metric was to catch if more important/relevant documents were ranked lower by the recommender then it should be penalized more.
I also tried to repeat the same activity using the SparkRankingEvaluation
class:
from recommenders.evaluation.spark_evaluation import SparkRankingEvaluation
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_pred_sp = spark.createDataFrame(df_pred)
df_true_sp = spark.createDataFrame(df_true)
print(f"The ndcg at k={K} is {SparkRankingEvaluation(df_true_sp, df_pred_sp, k=K, relevancy_method='top_k', **HEADER).ndcg_at_k()}")
# output: The ndcg at k=3 is 1.0
On a separate note the ndcgAt(k)
method of pyspark.mllib.evaluation.RankingMetrics
only implements the exponential form but also uses the natural logarithm which is also not consistent as per the wiki
Note that Croft et al. (2010) and Burges et al. (2005) present the second DCG with a log of base e, while both versions of DCG above use a log of base 2. When computing NDCG with the first formulation of DCG, the base of the log does not matter, but the base of the log does affect the value of NDCG for the second formulation. Clearly, the base of the log affects the value of DCG in both formulations.
Please correct me if I misunderstood.
A few references:
scikit-learn
under sklearn.metrics.ndcg_score
but it takes slightly different inputs.Please treat this issue with a higher priority as this is an important metric that is being used to compare all the benchmarks.
OFF TOPIC FEATURE REQUEST
Platform details: Ubuntu-20.04 Python 3.8.10 numpy==1.22.4 pandas==1.4.2 pyspark==3.3.0 recommenders==1.1.0
Could we have an update here?
@riosv I guess the example on the wiki page is different from our case, in terms of how "relevance" is defined. That is, on the wiki page, the document has 4 different types of relevance scores, which are 0, 1, 2, and 3, while in our case, the relevant score has only 0 and 1 where 1 indicates a hit of the recommendation. 0 leads to no gain here so when calculating the cumulative gain, only 1 is counted.
@deep-chakraborty in your example, the NDCG is scored as 1.0 because the predictions can always find a "relevance" in the ground truth no matter how they change the order. If you can mock an irrelevant item in the predictions, say "item 4", and position it into different places in the prediction dataframe, then you should be able to see the matter of order.
@yueguoguo
@deep-chakraborty in your example, the NDCG is scored as 1.0 because the predictions can always find a "relevance" in the ground truth no matter how they change the order.
So if I understood you correctly - changing the order of the recommendations shouldn't affect the score? This does not sound very intuitive tbh. Also if relevance being considered is always 1.0 then that means that the ground-truth items also don't have any ordering as they should always be ordered according to their "relevance" and yes the same relevance that is used to order them should be used to calculate the DCG of the recommendations.
To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of all known relevance judgments.
Could you please cite a paper or example which you are using as reference for your implementation if its not the wiki?
@deep-chakraborty the implementation was from Spark MLlib as stated in the docstring.
For your question, I think that is a known limitation of NDCG. See the following from wiki page.
Normalized DCG does not penalize for missing documents in the result. For example, if a query returns two results with scores 1,1,1 and 1,1,1,1,1 respectively, both would be considered equally good, assuming ideal DCG is computed to rank 3 for the former and rank 5 for the latter. One way to take into account this limitation is to enforce fixed set size for the result set and use minimum scores for the missing documents. In previous example, we would use the scores 1,1,1,0,0 and 1,1,1,1,1 and quote nDCG as nDCG@5.
That is to say, in your example, all the prediction results get the relevance scores of 1, 1, 1, meaning that there is no point to rank them unless there is a missing or false prediction in the results, i.e., 0. I think the meaning behind this is that, when we recommend k items to users, we only care about the order when there are irrelevant items recommended.
@yueguoguo
Thanks for citing the wiki I am indeed aware of the limitations of the nDCG.
I am not sure we understand each other completely. My concern was not regarding "missing" documents but rather the "order" of the documents. I would like to ask again - how do we "sort" the ground truth? I hope we are not confusing between "hit" and "ranking" metrics here. It has been correctly defined in the example notebook.
The "relevance" should be always defined using the scores/ratings that the user has given.
Secondly, about the reference - I don't think you are completely following the Spark MLlib's implementation here, they are using the exponential form. In addition to this, I also pointed out an inconsistency in the Spark's implementation as well regarding the selection of the base of the logarithm.
Thanks @deep-chakraborty. Here are my thoughts.
log1p
. Hope I it makes sense.
@yueguoguo
I am struggling to understand these two statements
If the top-k results contain the irrelevant items then order matters because we want to know how "high" the position is with regard to the predicted relevant item in the top-k;
If I understood you correctly you say "order only matters in case of irrelevant predicted items". Is it correct?
otherwise we don't need to consider the order because the user will anyway interact with all of them
This definitely is probably something that is a "version" of nDCG implemented only for this repo and does not align with many use cases out there. I think we should take a step back and revisit why are we trying to measure the goodness of the model with this metric anyway.
For example, if a user has seen 10 movies out of 1000 movies on Netflix then it is not fair to assume that the user "is going to interact anyway" with all those 900 movies that he has not watched. In that case, it is very important that the sorting of the recommended movies are "correct".
For evaluation purposes, we sort the ground truth, let's say, 200 of the remaining 900 movies according to the ratings (explicit feedback) given by the user and that is how the "actual relevance" of each movie is taken into account. We ask, any general model/algorithm, to tell us which movies should the user watch today (first), next week (second), and so on for these 200 movies based on its own internal interpretation/representation of importance expressed as "predicted ratings". Now as we already know in what sequence/order the user would "actually" watch the movies we wish to compare the two ordered lists/sets and our goal is to quantify the difference. There are many ways to quantify it of which nDCG is one.
So if we are unable to measure this "goodness" with the implementation of the nDCG done in this repo then I can't think of any good reason why this implementation should be used anyway especially considering the fact that the "hit" related ratings (as shown in the example notebook and in the screenshot) do exactly the same thing.
In the light of these points I would strongly suggest you please take the sklearn.metrics.ndcg_score as reference instead of Spark MLlib as they have also taken into account the relevant papers (follow the discussion thread here) OR use it directly under the hood and just have a wrapper that handles the spark/pandas dataframes.
A sample implementation from my end would be the following
from math import log2
from typing import List
from typing import Literal
from typing import Optional
from typing import Union
import pandas as pd
from sklearn.metrics import ndcg_score
def _ndcg(
true: List[Union[str, int]],
pred: List[Union[str, int]],
ratings: Optional[List[Union[int, float]]] = None,
k: int = 3,
form: Literal["linear", "exp", "identity"] = "linear",
) -> float:
k_ = min(len(true), len(pred), k)
if ratings:
assert len(ratings) >= k_, "All true labels must have a rating"
assert ratings == sorted(ratings, reverse=True), "Ratings must be sorted in descending order"
rel_ = lambda x: ratings[true.index(x)] if x in true else 0 # noqa
else:
# make "rank" as the rating - useful in case of implicit feedback cases
rel_ = (
lambda x: (len(true) - true.index(x) + 1) if x in true else 0
) # noqa
if form == "exp":
rel = lambda x: 2 ** rel_(x) - 1 # noqa
elif form == "identity":
# this is the current implementation
rel = lambda x: 1.0 if x in true else 0.0 # noqa
else:
rel = rel_
disc = lambda i: log2(i + 2) # noqa
dcg_at_k_ = ideal_dcg_at_k_ = 0
for i, (ele, ele2) in enumerate(zip(pred[:k_], true[:k_])):
dcg_at_k_ += rel(ele) / disc(i) # type: ignore
ideal_dcg_at_k_ += rel(ele2) / disc(i) # type: ignore
return dcg_at_k_ / ideal_dcg_at_k_
def ndcg(
df_true: pd.DataFrame,
df_pred: pd.DataFrame,
user_col_name: str = "user_id",
item_col_name: str = "item_id",
rating_col_name: str = "rating",
prediction_col_name: str = "prediction",
top_k: int = 3,
form: Literal["linear", "exp", "identity"] = "linear",
consider_actual_ratings: bool = False,
show_debug_output: bool = False,
) -> float:
def agg_sort(df: pd.DataFrame, sort_col_name: str):
return (
df.sort_values(by=[user_col_name, sort_col_name], ascending=False)
.groupby(user_col_name)
.agg({item_col_name: list, sort_col_name: list})
.rename(
columns={
item_col_name: "pred" if "prediction" in sort_col_name else "true"
}
)
)
df_pred_ = agg_sort(df_pred, prediction_col_name)
df_true_ = agg_sort(df_true, rating_col_name)
df_final_ = df_true_.join(df_pred_, on=user_col_name, how="inner")
if consider_actual_ratings:
df_final_["ndcg"] = df_final_.apply(
lambda x: _ndcg(x["true"], x["pred"], x["rating"], k=top_k, form=form),
axis=1,
)
else:
df_final_["ndcg"] = df_final_.apply(
lambda x: _ndcg(x["true"], x["pred"], k=top_k, form=form), axis=1
)
if show_debug_output:
print(df_final_)
return df_final_["ndcg"].mean()
if __name__ == "__main__":
df_true = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": ["a", "b", "c"],
"rating": [5, 4, 3],
}
)
df_pred = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": ["a", "b", "c"],
"prediction": [13.1, 14.2, 12.1],
}
)
assert round(ndcg(
df_true,
df_pred,
form="linear",
consider_actual_ratings=True
), 6) == round(
ndcg_score(
y_true = [[5, 4, 3]],
y_score= [[13.1, 14.2, 12.1]]
),
6
)
Ofcourse, there are many corner cases that have not been taken care of in the above implementation but it is quite simple to understand and is consistent with other benchmark implementations out there (sklearn in this case). Hope it makes things clearer also would be happy to contribute if needed.
@yueguoguo
I am struggling to understand these two statements
If the top-k results contain the irrelevant items then order matters because we want to know how "high" the position is with regard to the predicted relevant item in the top-k;
If I understood you correctly you say "order only matters in case of irrelevant predicted items". Is it correct?
otherwise we don't need to consider the order because the user will anyway interact with all of them
This definitely is probably something that is a "version" of nDCG implemented only for this repo and does not align with many use cases out there. I think we should take a step back and revisit why are we trying to measure the goodness of the model with this metric anyway.
For example, if a user has seen 10 movies out of 1000 movies on Netflix then it is not fair to assume that the user "is going to interact anyway" with all those 900 movies that he has not watched. In that case, it is very important that the sorting of the recommended movies are "correct".
For evaluation purposes, we sort the ground truth, let's say, 200 of the remaining 900 movies according to the ratings (explicit feedback) given by the user and that is how the "actual relevance" of each movie is taken into account. We ask, any general model/algorithm, to tell us which movies should the user watch today (first), next week (second), and so on for these 200 movies based on its own internal interpretation/representation of importance expressed as "predicted ratings". Now as we already know in what sequence/order the user would "actually" watch the movies we wish to compare the two ordered lists/sets and our goal is to quantify the difference. There are many ways to quantify it of which nDCG is one.
So if we are unable to measure this "goodness" with the implementation of the nDCG done in this repo then I can't think of any good reason why this implementation should be used anyway especially considering the fact that the "hit" related ratings (as shown in the example notebook and in the screenshot) do exactly the same thing.
In the light of these points I would strongly suggest you please take the sklearn.metrics.ndcg_score as reference instead of Spark MLlib as they have also taken into account the relevant papers (follow the discussion thread here) OR use it directly under the hood and just have a wrapper that handles the spark/pandas dataframes.
Quick response - I agree - this is in line with my points in the earlier response. That is, if ratings are considered in relevance, we should involve them in calculating nDCG and that gives us a more in-depth insight about the recommendation quality w.r.t user preferences, i.e., ratings. NOTE we still want to keep hits-based approach because not all the times explicit ratings are obtainable and reliable.
I am thinking of an extension of the existing implementation. Gratitude to your reply with codes where a possible implementation is proposed 😄 I guess something we can consider here is to make it generic, e.g., the function ndcg
offers a parameter that allows the user to apply either hits-based calculation or rating-based calculation, depending on whether rating is available or not.
@miguelgfierro @gramhagen @anargyri you may want to take a look at the thread.
@yueguoguo
Thank you for your response. I would like to add one point regarding
e.g., the function
ndcg
offers a parameter that allows the user to apply either hits-based calculation or rating-based calculation, depending on whether rating is available or not.
For that we can simply pass form="identity"
in my implementation. Looking forward to more feedback. :)
As @yueguoguo said, we use Spark MLLib definitions, a similar question arose with other metrics like MAP, see #1702. The main thing is that in the absence of any relevance information one needs to use the constant value 1. And it is not known how the data scientist will compute relevance in their specific scenario, it could be based on ratings or it could be something different. In fact, ratings are not available in all recommendation scenarios. I view this as a potential feature to add i.e. we may want to include an optional dataframe argument for the relevance values in the metrics definitions, with the default being the current computation.
@anargyri
Thanks a lot for participating in this thread. Regarding the point that you raised -
And it is not known how the data scientist will compute relevance in their specific scenario, it could be based on ratings or it could be something different
I think for the datasets (that is the Movielens in most cases) where we are not in an "implicit feedback" scenario and already have the ratings from the user it is quite fair to assume that we use these ratings to sort the ground truth list of items. More so when we are using this dataset to benchmark most SOTA reco methods.
In fact, ratings are not available in all recommendation scenarios.
I totally agree and this is indeed the case with my current projects in my job right now.
If no objections, I will close this issue since the above PR has been merged. @miguelgfierro @loomlike
Yeah
Description
The function
ndcg_at_k
returns incorrect values.For example, using the example described in https://en.wikipedia.org/wiki/Discounted_cumulative_gain:
As described in the wiki page, the correct value of NDCG@6 should be 0.785 but
ndcg_at_k
returns 1.0The cause could be that in the function's body, DCG and NDCG are calculated always using
1
as numerator while the formula requires to use the true relevance score. For example this is how DCG is calculated:df_dcg["dcg"] = 1 / np.log1p(df_dcg["rank"])
In which platform does it happen?
MacOS Python 3.9.12 Recommenders 1.1.0
How do we replicate the issue?
Expected behavior (i.e. solution)
For the example described above
ndcg_at_k
should return 0.785Other Comments