Open tangzhy opened 4 years ago
The Retrieval
task uses a form of sampled softmax ("in-batch softmax") that uses the other elements of the batch as negatives.
The interesting questions here revolve around what sampling strategy for negatives we should adopt: sampled softmax by default samples uniformly from all candidates; in-batch softmax is biased towards the positives distribution.
You can certainly use either with TFRS: the built-in Retrieval
task currently only offers in-batch softmax, but you should be able to use sampled softmax and re-use much the TFRS machinery.
@maciejkula Thanks for your patient answer.
I've seen many papers and experiments state that in-batch softmax often achieves better performance over others. Some combinations like in-batch softmax + one global random negative example go even further, ex, Dense Passage Retrieval for Open-Domain Question Answering
I wonder whether your consideration of choosing in-batch softmax as default implementation supports my above observations and maybe verified in large scale scenario at google already.
If I want to introduce some flexibility to combine in-batch softmax and global negative sampling: all I need to do is to modify the built-in Retrieval
task, is it?
In-batch softmax is definitely a very successful strategy; you can have a look at this paper for details and extensions.
There is actually a simpler way of adding global negative sampling: simply add additional rows to the end of candidate embeddings matrix you pass to the existing Retrieval
task. For example, right now you have 10 rows for user embeddings and 10 rows for candidate embeddings; if you append 10 additional rows to candidate embeddings those 10 rows will act as global negatives. You could sample those any way you want.
We plan on documenting this functionality/making it easier, but for now it should have the effect you want!
@maciejkula I find your answer really interesting when taking a close look to Retrieval
source code.
I notice that you've already taken the sampling-bias-corrected paper in account. Especially for the codes I paste as follows:
class Retrieval(tf.keras.layers.Layer, base.Task):
def __init__(self,
loss: Optional[tf.keras.losses.Loss] = None,
metrics: Optional[tfrs_metrics.FactorizedTopK] = None,
temperature: Optional[float] = None,
num_hard_negatives: Optional[int] = None,
name: Optional[Text] = None) -> None:
"""Initializes the task.
Args:
loss: Loss function. Defaults to
`tf.keras.losses.CategoricalCrossentropy`.
metrics: Object for evaluating top-K metrics over a
corpus of candidates. These metrics measure how good the model is at
picking the true candidate out of all possible candidates in the system.
Note, because the metrics range over the entire candidate set, they are
usually much slower to compute. Consider setting `compute_metrics=False`
during training to save the time in computing the metrics.
temperature: Temperature of the softmax.
num_hard_negatives: If positive, the `num_hard_negatives` negative
examples with largest logits are kept when computing cross-entropy loss.
If larger than batch size or non-positive, all the negative examples are
kept.
name: Optional task name.
"""
...
def call(self,
query_embeddings: tf.Tensor,
candidate_embeddings: tf.Tensor,
sample_weight: Optional[tf.Tensor] = None,
candidate_sampling_probability: Optional[tf.Tensor] = None,
candidate_ids: Optional[tf.Tensor] = None,
compute_metrics: bool = True) -> tf.Tensor:
"""Computes the task loss and metrics.
The main argument are pairs of query and candidate embeddings: the first row
of query_embeddings denotes a query for which the candidate from the first
row of candidate embeddings was selected by the user.
The task will try to maximize the affinity of these query, candidate pairs
while minimizing the affinity between the query and candidates belonging
to other queries in the batch.
Args:
query_embeddings: [num_queries, embedding_dim] tensor of query
representations.
candidate_embeddings: [num_queries, embedding_dim] tensor of candidate
representations.
sample_weight: [num_queries] tensor of sample weights.
candidate_sampling_probability: Optional tensor of candidate sampling
probabilities. When given will be be used to correct the logits to
reflect the sampling probability of negative candidates.
candidate_ids: Optional tensor containing candidate ids. When given
enables removing accidental hits of examples used as negatives. An
accidental hit is defined as an candidate that is used as an in-batch
negative but has the same id with the positive candidate.
compute_metrics: Whether to compute metrics. Set this to False
during training for faster training.
Returns:
loss: Tensor of loss values.
"""
temperature
can be initialized for a Retrieval
task.sample_weight
, candidate_sampling_probability
, candidate_ids
to the loss. These are well designed for the sampling-bias-corrected paper. Is my understanding right? But here also come my questions:
sample_weight
. Say, the number of clicks = 5, how shall we pass it in? Is there any preprocessing transformation?candidate_sampling_probability
shall be calculated via some offline methods in advance, where each entry corresponds to the probability it gets selected in a random batch? Is there any code example like the algorithm described in the paper?candidate_ids
look like?When training on implicit dataset, we want to turn implicit confidence into sample_weight. Say, the number of clicks = 5, how shall we pass it in? Is there any preprocessing transformation?
You can directly pass the features for sample_weights
in writing your customized compute_loss
method in the tfrs.Model
class. Take the basic MovieLensModel
as an example (from tutorial):
class MovielensModel(tfrs.Model):
def __init__(self, user_model, movie_model):
super().__init__()
self.movie_model: tf.keras.Model = movie_model
self.user_model: tf.keras.Model = user_model
self.task: tf.keras.layers.Layer = task
def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
# We pick out the user features and pass them into the user model.
user_embeddings = self.user_model(features["user_id"])
# And pick out the movie features and pass them into the movie model,
# getting embeddings back.
positive_movie_embeddings = self.movie_model(features["movie_title"])
# implicit weights using number of clicks.
implicit_weights = features["click"]
# The task computes the loss and the metrics.
return self.task(user_embeddings, positive_movie_embeddings, sample_weight=implicit_weights)
For preprocessing, if using 'clicks', it is up to the application domain. In many applications, click follows a long tail power-law distribution. Some typical choices for preprocessing are 1) capping the number of clicks and 2) transform into log domain. Again, this is up to the application and data dependent.
candidate_sampling_probability shall be calculated via some offline methods in advance, where each entry corresponds to the probability it gets selected in a random batch? Is there any code example like the algorithm described in the paper?
Correct, this candidate_sampling_probability
can be computed offline and given as part of the features
in the training data. For streaming algorithm described in the paper, we plan to release the module in the future release.
What does candidate_ids look like?
Candidate ids will be a tensor of shape [batch_size, 1]
(or [tf.shape(candidate_embeddings)[0], 1]
if we have additional candidates), type tf.string
. Each element corresponds to a unique identifier of the candidate -- which is for removing negative candidates that have the same id with the positive.
Just an advice, all these sampling-bias-corrected designs matter so much in real large-scale retrieval tasks that I cannot wait to see any tutorial released on this topic! :)
Indeed! As you mentioned, negative sampling (as Maciej mentioned) and bias correction are critical components to make a model works in practice. We will share more on this, stay tuned!
@maciejkula @biteorange hi, I'm a bit lost in how to pass the data in Retrieval
task.
Say, my data is set up as follows:
user_embeddings = np.array([
[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.7]
]) # (num_queries, embedding_dim)
positive_movie_embeddings = np.array([
[0.2, 0.1],
[0.5, 0.2],
[0.2, 0.1]
]) # (num_queries, embedding_dim). Row 0 and row 2 is set up to be the same candidate.
sample_weight = np.array([5, 1, 9]) # (num_queries,)
candidate_sampling_probability = np.array([0.67, 0.52, 0.67]) # (num_queries,). Note entry 0 and entry 2 are the same.
candidate_ids = np.array([42, 48, 42]) # We set entry 0 and 2 to be the same candidate id.
inf
or greater than 1. Any advice on how to deal with them?In-batch softmax is definitely a very successful strategy; you can have a look at this paper for details and extensions.
There is actually a simpler way of adding global negative sampling: simply add additional rows to the end of candidate embeddings matrix you pass to the existing
Retrieval
task. For example, right now you have 10 rows for user embeddings and 10 rows for candidate embeddings; if you append 10 additional rows to candidate embeddings those 10 rows will act as global negatives. You could sample those any way you want.We plan on documenting this functionality/making it easier, but for now it should have the effect you want!
I tried passing the additional rows, code fails at https://github.com/tensorflow/recommenders/blob/28a28f02e524f14f3e6facd9e276cc82bbb719df/tensorflow_recommenders/tasks/retrieval.py#L156, How to add global negative examples ?
@kapilduhoon you must concatenate them to your candidate embeddings tensor on the first axis.
If you have B batch size, D dimensional output embeddings and K extra negatives, then you should have query_embeddings
with shape (B, D)
and candidate_embeddings
with shape (B + K, D)
.
The matrix multiplication will then return a scores
tensor of shape (B, B + K)
.
@patrickorlando
thanks for your response, Facing this error now
InvalidArgumentError: Incompatible shapes: [8192,256] vs. [4096,256] [[node retrieval_22/mul (defined at /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow_recommenders/metrics/factorized_top_k.py:81) ]] [Op:__inference_train_function_84266]
Errors may have originated from an input operation.
Input Source operations connected to node retrieval_22/mul:
query_model_22/sequential_180/dense_155/BiasAdd (defined at
concat_1 (defined at
Function call stack: train_function
In this case I would advise you to manually trace the data through your model components. This should help you identify clearly where the tensors don't match what you should expect.
Hi Team(@patrickorlando) is there any update regarding implementation of non-in batch negative example(like in this paper)? Can you share the way to implement this?
Hi, as is shown in the basic_retrieval tutorial, we seem to use
tf.keras.losses.CategoricalCrossentropy
loss as default.I wonder if there is any difference between that and
tf.nn.sampled_softmax_loss
? In my view, which is also mentioned in YoutubeDNN paper(Google 2016), it might be better to use sampled softmax (corresponding to the multi-class classification) since we are at the retrieval stage?If so, how can we incorporate sampled softmax in the model as we are using keras as the high-level api. Any example code?