tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.82k stars 272 forks source link

[Question]: Using other metrics such as `AUC`. #486

Open ydennisy opened 2 years ago

ydennisy commented 2 years ago

In much of the literature / guides outside of this project AUC seems to be a popular metric for recommender systems. TF and Keras specifically has an implementation here.

However I am not sure on:

Thanks in advance!

entylop commented 2 years ago

AUC is one of the examples given in the documentation for the batch_metrics argument:

Metrics measuring how good the model is at picking out the true candidate for a query from other candidates in the batch. For example, a batch AUC metric would measure the probability that the true candidate is scored higher than the other candidates in the batch.

https://www.tensorflow.org/recommenders/api_docs/python/tfrs/tasks/Retrieval

You can use it this way (in 0.6.0):

batch_metrics = [
    tf.keras.metrics.AUC(
        num_thresholds=200,
        curve="ROC",
        summation_method="interpolation",
        name="auc_metric",
        from_logits=True,
    )
]

task = tfrs.tasks.Retrieval(
    loss=entropy_loss,
    metrics=None,
    batch_metrics=batch_metrics,
    num_hard_negatives=None,
)

Also make sure you set compute_metrics=True and provide the candidate_ids parameter when calling the task:

5/5 [==============================] - 12s 131ms/step - auc: 0.8142
rlcauvin commented 2 years ago

The advice to provide the candidate IDs is very interesting, @entylop. I haven't seen examples that use it.

In my retrieval model, which has context features for the query and candidates, I have this task initialization code:

candidates = unique_candidate_ds.batch(128).map(lambda c: (c['item_id'], self.candidate_model(c)))
metrics = tfrs.metrics.FactorizedTopK(candidates = candidates)
batch_metrics = [tf.keras.metrics.AUC(from_logits = True, name = "retrieval_auc")]
self.task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
  loss = loss_calculator,
  metrics = metrics,
  batch_metrics = batch_metrics)

The model invokes the task as follows:

  def compute_loss(
    self,
    features: Dict[str, tf.Tensor],
    training: bool = False):

    query_output = self.query_model(features)
    candidate_output = self.candidate_model(features)
    loss = self.task(query_output, candidate_output)

    return loss

Based on your advice, it seems I should be passing the candidate IDs. How should I go about doing so, assuming I have access to the same candidates and unique_candidate_ds variables in the task initialization code?

Also, does including AUC or other batch metrics affect the loss?

ydennisy commented 2 years ago

@rlcauvin to use candidate ids you just change your line:

loss = self.task(query_output, candidate_output, candidate_ids=candidate_ids)

Also no metrics do not affect loss in anyway.

ydennisy commented 2 years ago

@entylop sorry but another question for you!

I have used your suggesting to add AUC (thank you!) now the strange thing is that AUC is going down over epochs whilst top k is improving alongside a decreasing loss. Any idea what could be causing such an effect?

Also I am not setting any num_hard_negatives at this stage.

rlcauvin commented 2 years ago

@rlcauvin to use candidate ids you just change your line:

loss = self.task(query_output, candidate_output, candidate_ids=candidate_ids)

Also no metrics do not affect loss in anyway.

Thanks. The description of the candidate_ids parameter is:

Optional tensor containing candidate ids. When given enables removing accidental hits of examples used as negatives. An accidental hit is defined as an candidate that is used as an in-batch negative but has the same id with the positive candidate.

I don't understand that description and under what circumstances it is beneficial to pass candidate IDs to self.task. I know the unique item IDs, the item IDs associated with positive samples, and the item IDs associated with negative samples. Would I benefit from passing one of them to self.task?

My particular model has a binary target (either the user clicked the item or not). I don't know if that makes a difference in whether the candidate_ids parameter is useful. My understanding is that I have to use just positives for training a retrieval model for a binary target.

patrickorlando commented 2 years ago

@rlcauvin, you pass the positive candidate_ids. If you construct the Retrieval task with remove_accidental_hits=true, then the loss calculation will ensure that any sampled negatives that are the same item as the positive for that example will be ignored.

If this is set to false then it doesn't affect training, and the ids are only used for calculating metrics.

entylop commented 2 years ago

Note that the parameter remove_accidental_hits was added in this pull request https://github.com/tensorflow/recommenders/pull/381 and is not available in the last released version (0.6.0).

ydennisy commented 2 years ago

Ohh good spot @entylop!

@maciejkula when will the latest changes be released?

rlcauvin commented 2 years ago

@rlcauvin, you pass the positive candidate_ids. If you construct the Retrieval task with remove_accidental_hits=true, then the loss calculation will ensure that any sampled negatives that are the same item as the positive for that example will be ignored.

If this is set to false then it doesn't affect training, and the ids are only used for calculating metrics.

Thanks, @patrickorlando. Do I need to pass the candidate_ids that are positive just for the given user? It seems from your very helpful post that

loss = self.task(query_output, candidate_output, candidate_ids=candidate_ids)

is intended to prevent situations where the negatives in the matrix are in fact positives (a.k.a. "accidental hits"). A positive interaction is between a user and an item. Thus "positive candidates" would seem to be the set of candidates with which a single user has interacted positively, and I shouldn't pass all of the candidate IDs that have been in positive interactions across any of the users in the data set.

Correct?

Roger

patrickorlando commented 2 years ago

Just the positives for that batch @rlcauvin. In practice sampling a negative that is a positive in another batch doesn't affect performance and provides some mild regularisation.

You have candidate_ids of shape (batch_size, 1), and scores of shape (batch_size, batch_size). Essentially we are just creating a mask where tf.cast(candidate_ids == tf.transpose(candidate_ids), dtype=tf.float32) - tf.identity(batch_size). e.g.,

candidate_ids = [[0,1,2,0,3,4,3,5]]
mask = [
  [0, 0, 0, 1, 0, 0, 0, 0], # 0
  [0, 0, 0, 0, 0, 0, 0, 0], # 1
  [0, 0, 0, 0, 0, 0, 0, 0], # 2
  [1, 0, 0, 0, 0, 0, 0, 0], # 0 
  [0, 0, 0, 0, 0, 0, 1, 0], # 3
  [0, 0, 0, 0, 0, 0, 0, 0], # 4
  [0, 0, 0, 0, 1, 0, 0, 0], # 3
  [0, 0, 0, 0, 0, 0, 0, 0]  # 5
]
rlcauvin commented 2 years ago

Thanks again, @patrickorlando. Your explanations and examples help a lot.

My larger challenge is that I have a retrieval model (of user clicks on items) that just doesn't learn, seemingly no matter what I try. (The ranking model on the same training and validation data learns and performs strikingly well.) For the retrieval model, I've tried:

  1. Just using user IDs and item IDs.
  2. Incorporating context features, such as the user's age and gender and the item's categories and word vector.
  3. Using tf.keras.losses.CategoricalCrossentropy.
  4. Using tf.keras.losses.BinaryCrossentropy.
  5. Using tf.keras.optimizers.Adam and tf.keras.optimizers.Adagrad with various learning rates and schedules.

I thought maybe if I tried passing the candidate_ids in the task invocation, it might improve the learning. So far, I haven't figured out how to do it, because I can't access the values of features["item_id"] or features["rating"] in the compute_loss method of the Keras model. I assume I need to know the item IDs and labels in the batch to know which candidate IDs are positives.

For all of what I've tried so far, the top K categorical accuracy metric and AUC both show the retrieval model performs no better than random. Loss, however, does decrease substantially during training.

patrickorlando commented 2 years ago

Hey @rlcauvin, I would start with

just using user IDs and item IDs

The model should learn based on just this.

When you experimented with the ranking model, is everything else in your code kept the same? Same lookup layers, embedding layers, tf.data pipelines?

I would:

  1. Ensure that the lookups are working as expected, take a few examples and manually pass them through. Are any items being mapped to the [UNK] token 1? Is the shape correct, they should be only 1 dimensional, (batch_size,).
  2. Pass them through the embedding layers. is each row different? Is the shape correct (batch_size, n_dim)
  3. Do the matrix multiplication, are the scores different? do you get a shape that is (batch_size, batch_size)?

The shape is important, because if the query and candidate tensors have an extra dimension, the matrix multiplication will produce an incorrect result. Your loss will decrease but your model will be junk.

josealbertof commented 2 years ago

I would suggest to take a look at how AUC and Topk accuracy is being calculated. If the loss is decreasing but you see no improvements there are three possible problems:

I hope this helps you to identify the root cause of your problem

rlcauvin commented 2 years ago

A belated thank-you to @patrickorlando and @josealbertof. I set this conundrum aside for a few weeks but got back to it a few days ago. Here are my conclusions.

First, the items simply don't matter much in my training and test data. I computed the SHAP importances of the dozen or so features in the ranking model, and the item ID and other features associated with the items just didn't have much importance. If the items just don't matter much, then a retrieval model is probably going to behave only slightly better than randomly retrieving candidate items.

I tested this hypothesis by synthesizing training and test data in which users consistently preferred items in certain categories. When I retrained the retrieval model with this synthetic data, the AUC and the top K categorical accuracies improved dramatically, confirming my hypothesis.

Second, it appears that tf.keras.losses.CategoricalCrossentropy(from_logits = True) is the correct loss calculator to use for my retrieval model. While tf.keras.losses.BinaryCrossentropy(from_logits = True) works great as a loss calculator for the ranking model, the retrieval model doesn't learn using that loss calculator. I'm grasping at straws here for an explanation, but perhaps the retrieval task treats each candidate as a class, whereas the ranking task is optimizing for a binary outcome (user clicks the item or doesn't)?

patrickorlando commented 2 years ago

@rlcauvin,

I'm not quite sure how to help in this case. It sounds quite odd. I few thoughts come to mind:

  1. How are items presented to users in your product. If it's a retail service, were stock availability changes frequently, it could affect the Retrieval model.
  2. In your ranking model, what are you considering as negatives?
  3. Have you validated the recommendations qualitatively, or just using metrics? When inspecting the recommendations from your Ranking model, do they look good or make sense?

It is possible that details in your dataset or product make the problem more suitably fit a ranking problem, but I'm not sure the effect would be as pronounced as you describe. Wish I could be of more help.