recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
19.12k stars 3.09k forks source link

Refactor ranking metric `map` to be the same as Spark's #2004

Closed loomlike closed 12 months ago

loomlike commented 1 year ago

Description

  1. Refactor both python and spark ranking metric map to be the same as Spark's.

    • Rename map_at_k to map which basically the same as Spark's meanAveragePrecision
    • Add MAP@K map_at_k to be the same as Spark's meanAveragePrecisionAt
  2. Refactor evaluation tests to use the same data fixtures between python and spark tests.

Note: With this changes, some notebook tests are failing. I'll need to update the expected values for map_at_k metrics for those notebooks, or simply to use map that is our previous map_at_k.

Related Issues

2002

1990

1702

References

Checklist:

loomlike commented 1 year ago

btw, why draft PR triggers tests? @miguelgfierro @anargyri Can we disable that?

miguelgfierro commented 1 year ago

btw, why draft PR triggers tests? @miguelgfierro @anargyri Can we disable that?

This might be a solution: https://github.com/orgs/community/discussions/25722 we need to try it. Issue: #2005

anargyri commented 1 year ago

@loomlike are the test failures related to the values of the metrics? I vaguely recall that we had found some bugs in the Spark implementation (or they were not matching the definitions), hence our own implementations for some of the metrics.

loomlike commented 1 year ago

@anargyri I think the error was coming from test group names (as I changed some of the tests names, e.g. test_python_precision -> test_python_precision_at_k). I added a new commit to fix them in the test group lists. Let's see if that works.

Regarding the MAP, Spark added meanAveragePrecisionAt since v3.0.0. So I changed our map_at_k to be just map and added map_at_k which implements "at_k" parts for both python and spark utils.

Regarding the definition of MAP vs. MAP@K, actually what we had before was just MAP based on Wikipedia definition as well as Spark's implementation.

The community seems to follow the following definition:

AP@k is typically calculated for one user from all recommendations by averaging P@1, P@2, P@3, ... P@k

which the Spark's meanAveragePrecisionAt also follows it. The codes in this PR also follow that definition.

anargyri commented 1 year ago

Good, as long as the evaluation notebooks give the same metrics as before we should be fine.

miguelgfierro commented 1 year ago

@loomlike if you get an error on DCO you can go to details and approve it. Only admins can do this. I've approved this one.

miguelgfierro commented 1 year ago

@loomlike the issue is fixed https://github.com/recommenders-team/recommenders/pull/2017

Can you please merge the latest staging into your branch before committing?