[Question] Loss computation in movielens examples

dibya-pati commented 1 year ago

Hi, I'm trying to understand the loss computation for the movielens retrieval example. In case of movielens dataset there are ~900 users and ~1600 movies, and when we train the two tower model considering user(U_A)-item(I_A) pairs, we consider only the current U_A-I_A pair as positive (using tf.eye () for labels) and penalizing every other UA-I{!A} combinations in the batch. My question is:

One user has interacted with multiple items by penalizing all other User-Item pairs in batch, we are also penalizing some U_A-I_B pairs that are positive
On increasing the batch size, the contributions from positive pairs reduces significantly

OmarMAmin commented 1 year ago

hi @dibya-pati, i encourage you to read through the other questions, you'll find answers there

it act as a regularization effect, and you can add features to the user tower (i.e. previous movies and other context features), to have different features per prediction, so it won't be the same input with different outputs expected once to predict movie a and another to predict movie b, as other people similar to you will try to bring your negative closer to their representation as well, not sure how, but you can implement your negative sampling if you'd like to avoid that, but it won't be as efficient as the current implementation --> also many papers mentioned that this sampling approach works well on different datasets.
It's a hyperparameter, more negatives means the model is trying to make harder predictions, you need to balance it not to be neither too hard nor too easy

OmarMAmin commented 1 year ago

This figure compares different loss functions, some functions using point wise loss, or pair wise loss (each positive has a corresponding negative), and it seems Sampled Softmax is performing better across tasks compared to these other losses

tensorflow / recommenders

[Question] Loss computation in movielens examples #643