Open houghtonweihu opened 1 year ago
We know that Retrieval is trained with in-batch negative sampling, which is to take other users' positive samples as the current user's negative samples, so this is an approximation of the true negative samples. Rank is trained with mse to predict the ratings. All these metrics are not direct measurement of movie recommendations. But it is the movie recommendations that really matter. I am not sure if there is a possibility for: the training metrics are improving, but the actual movie recommendations are worsening.
In the tutorials of Tensorflow Recommenders, top_k_categorical_accuracy is used for the evaluation of Retrieval, and mse for Rank. Do we have examples that show better evaluation metrics translate into better movie recommendations in the case of MovieLens dataset?