Is the 'mrr_score' implementation correct?

Hi,

I was recently using the mrr_score implementation (link):

def mrr_score(y_true, y_score):
    """Computing mrr score metric.

    Args:
        y_true (np.ndarray): Ground-truth labels.
        y_score (np.ndarray): Predicted labels.

    Returns:
        numpy.ndarray: mrr scores.
    """
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order)
    rr_score = y_true / (np.arange(len(y_true)) + 1)
    return np.sum(rr_score) / np.sum(y_true)

I am not sure if I've misunderstood the current implementation, but as far as I can see, it does not account for situations where there are multiple positive examples in one sample:

>>> mrr_score([1, 0, 0], [1, 0, 0])
1.0
>>> mrr_score([1, 1, 0], [1, 1, 0])
0.75

Furthermore, according to documentation the input should be Predicted labels, however, we are more interested in the ranking of the positive item in a given sample (MRR-wiki).

My suggestion is

def reciprocal_rank_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    order = np.argsort(y_pred)[::-1]
    y_true = np.take(y_true, order)
    first_positive_rank = np.argmax(y_true) + 1
    return 1.0 / first_positive_rank

>>> y_true_1 = np.array([0, 0, 1])
>>> y_pred_1 = np.array([0.5, 0.2, 0.1])
>>> reciprocal_rank_score(y_true_1, y_pred_1)
0.33

>>> y_true_2 = np.array([0, 1, 1])
>>> y_pred_2 = np.array([0.5, 0.2, 0.1])
>>> reciprocal_rank_score(y_true_2, y_pred_2)
0.5

>>> y_true_3 = np.array([1, 1, 0])
>>> y_pred_3 = np.array([0.5, 0.2, 0.1])
>>> reciprocal_rank_score(y_true_3, y_pred_3)
1.0

>>> np.mean([reciprocal_rank_score(y_true, y_pred)for y_true, y_pred in zip([y_true_1, y_true_2, y_true_3], [y_pred_1, y_pred_2, y_pred_3])])
0.611111111111111

The original implementation seems correct if asked for the rankings, not the labels for the prediction. When assuming all items are positive, as in my example:

mrr_score([1, 1, 1], [3, 2, 1])
0.611111111111111

But then, y_true is not a needed input.

If I haven't misunderstood and you agree I would be happy to make a PR with suggested improvements.

I followed the example used in the Medium post: MRR vs MAP vs NDCG: Rank-Aware Evaluation Metrics And When To Use Them (behind paywall).

Thanks for the awesome repo!

recommenders-team / recommenders

Is the 'mrr_score' implementation correct? #2141