Open OmarMAmin opened 1 year ago
Hi @OmarMAmin,
Not knowing the frequency distribution of your candidates, along with the fact that you have delivery constraints make it hard for me to have any intuition on this.
I would perhaps first compare a single metric, (top_k accuracy) in-batch vs over unique candidates. If these differ significantly I'd look at some examples and work from there.
Hi Team,
Thanks for the awesome discussions, I've been learning a lot from the discussions in the issues, and many of them improved the accuracy of the work I'm doing. I'm wondering if there's a big difference in calculating these two metrics or not as I'm able to get around 0.9 top_k_categorical_accuracy_at_100, but when I'm doing offline evaluation using recall at k I get around 0.35.
From my understanding top_k_categorical_accuracy_at_100 is count(interactions that the ground truth was ranked in the top k)
If my batch size is 1028, I can assume that top_k_categorical_accuracy_at_100 of 0.9 means that out of random 1028 candidates we get the right one in the top 100 candidates 90% of the cases, am i missing sth?
This is very close to the definition of recall I'm calculating recall at 100 = intersection(user future interactions, top 100 candidates for this user embedding)/len(user future interactions)
The key differences are for top_k_categorical_accuracy_at_100:
for Recall at 100 calculation
But I don't expect that the results would be that different 95% for top_k_categorical_accuracy_at_100 and 35% for recall at 100