tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.84k stars 275 forks source link

[Question] Difference between top_k_categorical_accuracy_at_100 and recall at 100 #617

Open OmarMAmin opened 1 year ago

OmarMAmin commented 1 year ago

Hi Team,

Thanks for the awesome discussions, I've been learning a lot from the discussions in the issues, and many of them improved the accuracy of the work I'm doing. I'm wondering if there's a big difference in calculating these two metrics or not as I'm able to get around 0.9 top_k_categorical_accuracy_at_100, but when I'm doing offline evaluation using recall at k I get around 0.35.

From my understanding top_k_categorical_accuracy_at_100 is count(interactions that the ground truth was ranked in the top k)

If my batch size is 1028, I can assume that top_k_categorical_accuracy_at_100 of 0.9 means that out of random 1028 candidates we get the right one in the top 100 candidates 90% of the cases, am i missing sth?

This is very close to the definition of recall I'm calculating recall at 100 = intersection(user future interactions, top 100 candidates for this user embedding)/len(user future interactions)

The key differences are for top_k_categorical_accuracy_at_100:

  1. candidates are selected randomly and I've a total of 3000 candidates, so many duplicates will appear in a batch of 1028 so the results would be optimistic (as the unique candidates would be much less than that)
  2. not all candidates (restaurants) can be deliver to queries (users) for all cases, so the right candidates are a subset of all candidates, if the model learnt that, it'll be able to easily neglect the candidates that can't interact with the query --> which will lead to an easier job for classifier. On the other hand

for Recall at 100 calculation

  1. I only considers the candidates (restaurants) that can deliver to the user so it's a harder problem
  2. No duplications, as i only deal with the unique candidates.

But I don't expect that the results would be that different 95% for top_k_categorical_accuracy_at_100 and 35% for recall at 100

patrickorlando commented 1 year ago

Hi @OmarMAmin,

Not knowing the frequency distribution of your candidates, along with the fact that you have delivery constraints make it hard for me to have any intuition on this.

I would perhaps first compare a single metric, (top_k accuracy) in-batch vs over unique candidates. If these differ significantly I'd look at some examples and work from there.