SampledSoftmax Loss in Retrieval

tangzhy commented 4 years ago

Hi, as is shown in the basic_retrieval tutorial, we seem to use tf.keras.losses.CategoricalCrossentropy loss as default.

I wonder if there is any difference between that and tf.nn.sampled_softmax_loss? In my view, which is also mentioned in YoutubeDNN paper(Google 2016), it might be better to use sampled softmax (corresponding to the multi-class classification) since we are at the retrieval stage?
If so, how can we incorporate sampled softmax in the model as we are using keras as the high-level api. Any example code?

maciejkula commented 4 years ago

The Retrieval task uses a form of sampled softmax ("in-batch softmax") that uses the other elements of the batch as negatives.

The interesting questions here revolve around what sampling strategy for negatives we should adopt: sampled softmax by default samples uniformly from all candidates; in-batch softmax is biased towards the positives distribution.

You can certainly use either with TFRS: the built-in Retrieval task currently only offers in-batch softmax, but you should be able to use sampled softmax and re-use much the TFRS machinery.

tangzhy commented 4 years ago

@maciejkula Thanks for your patient answer.

I've seen many papers and experiments state that in-batch softmax often achieves better performance over others. Some combinations like in-batch softmax + one global random negative example go even further, ex, Dense Passage Retrieval for Open-Domain Question Answering

I wonder whether your consideration of choosing in-batch softmax as default implementation supports my above observations and maybe verified in large scale scenario at google already.
If I want to introduce some flexibility to combine in-batch softmax and global negative sampling: all I need to do is to modify the built-in Retrieval task, is it?

maciejkula commented 4 years ago

In-batch softmax is definitely a very successful strategy; you can have a look at this paper for details and extensions.

There is actually a simpler way of adding global negative sampling: simply add additional rows to the end of candidate embeddings matrix you pass to the existing Retrieval task. For example, right now you have 10 rows for user embeddings and 10 rows for candidate embeddings; if you append 10 additional rows to candidate embeddings those 10 rows will act as global negatives. You could sample those any way you want.

We plan on documenting this functionality/making it easier, but for now it should have the effect you want!

tangzhy commented 4 years ago

@maciejkula I find your answer really interesting when taking a close look to Retrieval source code.

I notice that you've already taken the sampling-bias-corrected paper in account. Especially for the codes I paste as follows:

class Retrieval(tf.keras.layers.Layer, base.Task):
  def __init__(self,
               loss: Optional[tf.keras.losses.Loss] = None,
               metrics: Optional[tfrs_metrics.FactorizedTopK] = None,
               temperature: Optional[float] = None,
               num_hard_negatives: Optional[int] = None,
               name: Optional[Text] = None) -> None:
    """Initializes the task.
    Args:
      loss: Loss function. Defaults to
        `tf.keras.losses.CategoricalCrossentropy`.
      metrics: Object for evaluating top-K metrics over a
       corpus of candidates. These metrics measure how good the model is at
       picking the true candidate out of all possible candidates in the system.
       Note, because the metrics range over the entire candidate set, they are
       usually much slower to compute. Consider setting `compute_metrics=False`
       during training to save the time in computing the metrics.
      temperature: Temperature of the softmax.
      num_hard_negatives: If positive, the `num_hard_negatives` negative
        examples with largest logits are kept when computing cross-entropy loss.
        If larger than batch size or non-positive, all the negative examples are
        kept.
      name: Optional task name.
    """
  ...
  def call(self,
           query_embeddings: tf.Tensor,
           candidate_embeddings: tf.Tensor,
           sample_weight: Optional[tf.Tensor] = None,
           candidate_sampling_probability: Optional[tf.Tensor] = None,
           candidate_ids: Optional[tf.Tensor] = None,
           compute_metrics: bool = True) -> tf.Tensor:
    """Computes the task loss and metrics.
    The main argument are pairs of query and candidate embeddings: the first row
    of query_embeddings denotes a query for which the candidate from the first
    row of candidate embeddings was selected by the user.
    The task will try to maximize the affinity of these query, candidate pairs
    while minimizing the affinity between the query and candidates belonging
    to other queries in the batch.
    Args:
      query_embeddings: [num_queries, embedding_dim] tensor of query
        representations.
      candidate_embeddings: [num_queries, embedding_dim] tensor of candidate
        representations.
      sample_weight: [num_queries] tensor of sample weights.
      candidate_sampling_probability: Optional tensor of candidate sampling
        probabilities. When given will be be used to correct the logits to
        reflect the sampling probability of negative candidates.
      candidate_ids: Optional tensor containing candidate ids. When given
        enables removing accidental hits of examples used as negatives. An
        accidental hit is defined as an candidate that is used as an in-batch
        negative but has the same id with the positive candidate.
      compute_metrics: Whether to compute metrics. Set this to False
        during training for faster training.
    Returns:
      loss: Tensor of loss values.
    """

The hyper parameter temperature can be initialized for a Retrieval task.
When we call the task, we can pass in sample_weight, candidate_sampling_probability, candidate_ids to the loss. These are well designed for the sampling-bias-corrected paper. Is my understanding right?

But here also come my questions:

When training on implicit dataset, we want to turn implicit confidence into sample_weight. Say, the number of clicks = 5, how shall we pass it in? Is there any preprocessing transformation?
candidate_sampling_probability shall be calculated via some offline methods in advance, where each entry corresponds to the probability it gets selected in a random batch? Is there any code example like the algorithm described in the paper?
What does candidate_ids look like?
Just an advice, all these sampling-bias-corrected designs matter so much in real large-scale retrieval tasks that I cannot wait to see any tutorial released on this topic! :)

biteorange commented 4 years ago

When training on implicit dataset, we want to turn implicit confidence into sample_weight. Say, the number of clicks = 5, how shall we pass it in? Is there any preprocessing transformation?

You can directly pass the features for sample_weights in writing your customized compute_loss method in the tfrs.Model class. Take the basic MovieLensModel as an example (from tutorial):

class MovielensModel(tfrs.Model):

  def __init__(self, user_model, movie_model):
    super().__init__()
    self.movie_model: tf.keras.Model = movie_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["user_id"])
    # And pick out the movie features and pass them into the movie model,
    # getting embeddings back.
    positive_movie_embeddings = self.movie_model(features["movie_title"])

    # implicit weights using number of clicks.
    implicit_weights = features["click"]

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, positive_movie_embeddings, sample_weight=implicit_weights)

For preprocessing, if using 'clicks', it is up to the application domain. In many applications, click follows a long tail power-law distribution. Some typical choices for preprocessing are 1) capping the number of clicks and 2) transform into log domain. Again, this is up to the application and data dependent.

candidate_sampling_probability shall be calculated via some offline methods in advance, where each entry corresponds to the probability it gets selected in a random batch? Is there any code example like the algorithm described in the paper?

Correct, this candidate_sampling_probability can be computed offline and given as part of the features in the training data. For streaming algorithm described in the paper, we plan to release the module in the future release.

What does candidate_ids look like?

Candidate ids will be a tensor of shape [batch_size, 1] (or [tf.shape(candidate_embeddings)[0], 1] if we have additional candidates), type tf.string. Each element corresponds to a unique identifier of the candidate -- which is for removing negative candidates that have the same id with the positive.

Just an advice, all these sampling-bias-corrected designs matter so much in real large-scale retrieval tasks that I cannot wait to see any tutorial released on this topic! :)

Indeed! As you mentioned, negative sampling (as Maciej mentioned) and bias correction are critical components to make a model works in practice. We will share more on this, stay tuned!

tangzhy commented 4 years ago

@maciejkula @biteorange hi, I'm a bit lost in how to pass the data in Retrieval task.

Say, my data is set up as follows:

user_embeddings = np.array([
    [0.1, 0.2], 
    [0.3, 0.4], 
    [0.5, 0.7]
]) # (num_queries, embedding_dim)

positive_movie_embeddings = np.array([
    [0.2, 0.1], 
    [0.5, 0.2], 
    [0.2, 0.1]
]) # (num_queries, embedding_dim). Row 0 and row 2 is set up to be the same candidate.

sample_weight = np.array([5, 1, 9]) # (num_queries,)

candidate_sampling_probability = np.array([0.67, 0.52, 0.67]) # (num_queries,). Note entry 0 and entry 2 are the same.

candidate_ids = np.array([42, 48, 42]) # We set entry 0 and 2 to be the same candidate id.

Does this toy example align with what you mean?
The candidate_sampling_probability is generated by a moving average algorithm. Say we initialize the array A and B proposed in the paper with all 0 values. There might be some extreme cases, where the candidates never occur in the batching flow, or occur only for several times. Thus the moving average value ends up between [0, 1]. When we use the reciprocal of that value as the candidate_sampling_probability, It would be inf or greater than 1. Any advice on how to deal with them?

kapilduhoon commented 1 year ago

In-batch softmax is definitely a very successful strategy; you can have a look at this paper for details and extensions.

There is actually a simpler way of adding global negative sampling: simply add additional rows to the end of candidate embeddings matrix you pass to the existing Retrieval task. For example, right now you have 10 rows for user embeddings and 10 rows for candidate embeddings; if you append 10 additional rows to candidate embeddings those 10 rows will act as global negatives. You could sample those any way you want.

We plan on documenting this functionality/making it easier, but for now it should have the effect you want!

I tried passing the additional rows, code fails at https://github.com/tensorflow/recommenders/blob/28a28f02e524f14f3e6facd9e276cc82bbb719df/tensorflow_recommenders/tasks/retrieval.py#L156, How to add global negative examples ?

patrickorlando commented 1 year ago

@kapilduhoon you must concatenate them to your candidate embeddings tensor on the first axis. If you have B batch size, D dimensional output embeddings and K extra negatives, then you should have query_embeddings with shape (B, D) and candidate_embeddings with shape (B + K, D). The matrix multiplication will then return a scores tensor of shape (B, B + K).

kapilduhoon commented 1 year ago

@patrickorlando

thanks for your response, Facing this error now

InvalidArgumentError: Incompatible shapes: [8192,256] vs. [4096,256] [[node retrieval_22/mul (defined at /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow_recommenders/metrics/factorized_top_k.py:81) ]] [Op:__inference_train_function_84266]

Errors may have originated from an input operation. Input Source operations connected to node retrieval_22/mul: query_model_22/sequential_180/dense_155/BiasAdd (defined at :147)
concat_1 (defined at :106)

Function call stack: train_function

patrickorlando commented 1 year ago

In this case I would advise you to manually trace the data through your model components. This should help you identify clearly where the tensors don't match what you should expect.

Akshaysharma29 commented 11 months ago

Hi Team(@patrickorlando) is there any update regarding implementation of non-in batch negative example(like in this paper)? Can you share the way to implement this?

tensorflow / recommenders

SampledSoftmax Loss in Retrieval #140