tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.83k stars 274 forks source link

Define model with explicit positive feedback and explicit negative feedback #428

Open MNMaqsood opened 2 years ago

MNMaqsood commented 2 years ago

Hey, I am working on a problem where I have 4 different feedback signals. I am trying to recommend new videos in a playlist to a user.

My dataset is such that I have pre-defined playlists categories with each category having predefined possible videos. Now the problem is to recommend new video to a user. A user can watch a video, like a video after watching it, skip the video in the middle and abandon/leave the playlist and move on to a new playlist.

My goal is to recommend those videos that are liked by user or watched by user and avoid those videos that are skipped by user or after which user left the playlist. So I have four feedback signals available at hand 1) Implicit positive feedback given by watched videos 2) Explicit positive feedback given at the end of video 3) Explicit negative feedback given by skipping of video 4) Explicit negative feedback given by abandonment of playlist

Now how I should go about structuring the model?

Following #139, it's clear that I can define two retrieval task, First one for Implicit positive feedback and the second one for explicitly positive feedback.

But how should I use explicit negative feedback? I looked into #232 but that didn't provide an answer. I was thinking of doing

MNMaqsood commented 2 years ago

@maciejkula any feedback on this issue?

patrickorlando commented 2 years ago

The general approach that is to train the retrieval model on only positive interactions, and to train a ranking model that predicts the satisfaction based on positive and negative feedback.

In the case of the YouTube recommender this ranking model predicts expected watch time.

You can try to follow the mutli-task tutorial. Or failing that you can just train two separate models.

At inference time you retrieve a set of candidates then score them with the ranking model.

rlcauvin commented 2 years ago

@patrickorlando You wrote that the general approach is train the retrieval model on only positive interactions, and the ranking model on both positive and negative feedback. You also mentioned the possibility of doing both using a model such as the one in the multi-task tutorial.

What if the positive interaction I am predicting is the user clicking an item, and the negative interaction is the user choosing not to click an item when given an option to do so?

In that case, is it possible to train a multi-task model, or do I need to separately train a retrieval model with just the positive interactions (clicks), and a ranking model with all the interactions (both clicks and non-clicks)?

Roger

patrickorlando commented 2 years ago

Hey @rlcauvin,

I think you are right in the sense that a multitask model won't be suitable and you'd need to train two separate models.

The crux of the problem is that we are using a categorical cross entropy loss for retrieval loss. If you pass a negative sample and set the label to 0, then all candidates for that row will be zero, and the Categorical Cross Entropy will also be 0.

import tensorflow as tf
loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
y_pred = tf.constant([[2.105886  , 1.3653879 , 2.0193396 , 1.5547794 , 0.83698046]])

y_true = tf.constant([[1,0,0,0,0]])
loss_fn(y_true, y_pred)
# 1.1790919

y_true = tf.constant([[0,0,0,0,0]])
loss_fn(y_true, y_pred)
# 0

So, for every row in your batch you need at least 1 positive. In some cases this may be possible. For example, the user sees two items at the same, they click on one and you can use the other as an explicit negative. Otherwise, I'm not sure how you could achieve it.

rlcauvin commented 2 years ago

Thanks, @patrickorlando.

My current recommender project is binary classification where the user is presented with a single item and decides whether to click. I've been unable to get the retrieval model to learn.

I train the model using only the positive interactions (clicks), and I have set aside 20% of the positive interactions in a test dataset for validation and evaluation purposes. During training, the AUC, and top k categorical accuracy metrics appear to increase nicely, but the actual validation metrics computed at the end of each epoch stay flat.

I've tried different learning rates, optimizers (Adam and Adagrad), loss calculators (BinaryCrossentropy and CategoricalCrossentropy, with True and False for from_logits), and the model just never learns anything when evaluated against the test data.

Roger

MNMaqsood commented 2 years ago

Hey @rlcauvin,

Can you elaborate more on your dataset and how you are training your model? There are things like the batch size that also affect the performance of recommendation systems.

Also, I think it'll be better if you can add contextual features like user interests, age, time, etc. as they tend to have an effect on recommendation and maybe you can try a session-based recommendation system.

Regarding your binary classification, I think you can train two separate models. The first one will only be a retrieval model based on positive interactions and in the second you can train the model to jointly optimize for both negative and positive classes.

rlcauvin commented 2 years ago

@MNMaqsood Thanks for the advice regarding context features, separate retrieval and ranking models. I've done both of those things, and the retrieval model just doesn't learn when evaluated against the test data.

I prepare the train and test data samples as follows:

positive_train_ds = train_ds.filter(lambda x: x["rating"] > 0)
cached_positive_train_ds = positive_train_ds.batch(8192).cache()

positive_test_ds = test_ds.filter(lambda x: x["rating"] > 0)
cached_positive_test_ds = positive_test_ds.batch(4096).cache()

The model training code is:

num_epochs = 50
validation_freq = 1

retrieval_model = RetrievalModel(
  vocabularies = vocabularies,
  query_feature_names = query_feature_names,
  candidate_feature_names = candidate_feature_names)

lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
  initial_learning_rate = 0.0005,
  decay_steps = 544,
  decay_rate = .5,
  staircase = False)

retrieval_model.compile(optimizer = tf.keras.optimizers.Adam(lr_schedule))

retrieval_model_history = retrieval_model.fit(
    cached_positive_train_ds,
    validation_data = cached_positive_test_ds,
    validation_freq = validation_freq,
    epochs = num_epochs,
    verbose = 1)

FYI, I saw another question https://github.com/tensorflow/recommenders/issues/486#issuecomment-1115251396 where the advice mentioned including the candidate_ids parameter when invoking the retrieval task. I'm not quite sure how that works.

rlcauvin commented 1 year ago

At long last I'm following up on the topic of getting a retrieval model with query and candidate features to learn.

The keys were using tf.keras.losses.CategoricalCrossentropy(from_logits = True) for the loss function and a candidates mapping from item_id to candidate_model in tfrs.metrics.FactorizedTopK metrics.

    loss_calculator = tf.keras.losses.CategoricalCrossentropy(from_logits = True)    
    candidates = unique_candidate_ds.batch(128).map(lambda c: (c['item_id'], self.candidate_model(c)))
    metrics = tfrs.metrics.FactorizedTopK(candidates = candidates)
    batch_metrics = [tf.keras.metrics.AUC(from_logits = True, name = "retrieval_auc")]
    self.task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
      loss = loss_calculator,
      metrics = metrics,
      batch_metrics = batch_metrics)

I still do wonder if there is a handy way of having the retrieval model learn from negatives in a binary classification scenario (e.g. implicitly where we present a single item to a user, and the user either clicks or does not click).

JV-Nunes commented 10 months ago

The general approach that is to train the retrieval model on only positive interactions, and to train a ranking model that predicts the satisfaction based on positive and negative feedback.

In the case of the YouTube recommender this ranking model predicts expected watch time.

You can try to follow the mutli-task tutorial. Or failing that you can just train two separate models.

At inference time you retrieve a set of candidates then score them with the ranking model.

@patrickorlando I have data from implicit ecommerce interactions, with click, view item, add to cart, purchase events. I'm thinking about creating a multi-task model as follows:

Does this type of architecture make sense to you? Or would two separate models perform better?

patrickorlando commented 10 months ago

Hi @JV-Nunes,

With tf-recommenders it should be easy to try both multi-task and separate mode, but if you can't, I would initially develop each model separately. How you formulate the ranking objective is most important. If this isn't working well and the model isn't learning the correct signals, then training it as a multi-task with the retrieval will negatively affect the retrieval task.

I can't say for sure how you should design the task. You could try your continuous variable approach above, but it has drawbacks. Firstly, your data will skew heavily towards lower ratings and MSE will not encourage the model to learn the high ratings (which are the ones you care about.) It will be skewed far more than the movielens example, in which a user selects a single rating in one action, because in your case a high rating requires the user to complete a sequence of actions. You could try sample weighting.

Secondly, you need to work out how to evaluate the model. MSE doesn't clearly align with what's important, the probability that the user will purchase an item given viewing it.

Typically for ranking problems, you would like to predict the likelihood of the user performing an action, P(click | impression), P(bookmark | impression) or P(purchase | add_to_cart).

So alternatively you could handle this as a multi-task binary classification problem. You have a core model which is shared between all tasks, then you have a separate prediction head for each action. These heads are effectively predicting:

At inference time, you score the candidate items and obtain these predictions and combine them into a score which you then rank your candidates. Whilst this is intuitively clearer, in practise it is quite challenging to train for two reasons:

  1. Each action depends on the previous action. In order to purchase, you first have to add to cart. Because of this not all examples can be used to train all heads. If an example contains a click, but no add to cart, it can only be used to train the click and add to cart heads. You would need to mask the loss and metrics from the purchase head. This adds complexity to your model.
  2. You need to tune how to weight the loss from each head in the combined loss.

You could also train a separate model for each of these tasks, but data sparsity will be an issue.

In short, I can't say how successful your continuous ranking approach will be or how hard it will be to successfully train a multi-task binary classification.

JV-Nunes commented 10 months ago

@patrickorlando Thank you very much for the clarifications. I decided to pursue the idea of ​​predicting the propensity to purchase, keeping separate models. What I did was add up the interactions carried out between the user and the item on the specific day, and the label is whether or not they bought the item. What you mentioned makes perfect sense, and there is even an implementation of a multitask model that provides click through rate and post click conversion rate, made by @caesarjuly https://happystrongcoder.substack.com/p/entire-space-multi-task-model-an. I'll try this as well.