hanxiao / hanxiao.github.io

Blog
https://hanxiao.io
23 stars 5 forks source link

Optimizing Contrastive/Rank/Triplet Loss in Tensorflow for Neural Information Retrieval · Han Xiao Tech Blog #4

Open hanxiao opened 6 years ago

hanxiao commented 6 years ago

https://hanxiao.github.io/2017/11/08/Optimizing-Contrastive-Rank-Triplet-Loss-in-Tensorflow-for-Neural/

Jack-Paz commented 6 years ago

Great article, very helpful.

Could you give an example of a reasonable value for weight?

hanxiao commented 6 years ago

@Jack-Paz In my previous work, I mainly use the weight to balance those few-shot/long-tail queries. You may also use something like log1p(num_clicks_queryi_on_productj) to value more on those popular (q,d) pairs. But to me this is really task-specific and no rule of thumb exists.

Ekkalak-T commented 6 years ago

Very nice and useful article.

Could you please provide an example of how to inference the model after done training?

My idea is

  1. Feed an input query to query encoder to get a query vector.
  2. Comparing the query vector with all document vectors by calculating the metric_p.
  3. Using the metric_p as a relevance score.
hanxiao commented 6 years ago

@Ekkalak-T Yes, you are on the correct track. To improve the inference-time performance, I would do this:

  1. Precomputed all document vectors using the trained doc-encoder and stored them, say in-memory.
  2. Every time you receive a new query, feed it to query-encoder and get the query vector
  3. Compute the similarity of query-vector and all stored doc-vectors.

Notice step 0 is a one time thing.

To improve the efficiency of step 2, you may precode your doc-vectors using specific data structures, such as KDtree. This is another topic called *Approximate Nearest Neighbours**. Please check Facebook's faiss

Ekkalak-T commented 6 years ago

@hanxiao Thank you so much for the clarification. Did you mean the metric_n is not involved in the inferencing and we can rank the documents by the only metric_p?

Regarding the inference-time performance. How can we improve the step 2 if we use 'MLP' in the metric layer?

hanxiao commented 6 years ago

@Ekkalak-T in the inference time there is no need to compute metric_n. you just compute metric_p and sort them descendingly as the final result, done! The reason is, in the training procedure the relevant query is already encouraged to get higher metric value.

When MLP is used, there is no easy way to improve the efficiency at step 2. A special case would be MLP without nonlinear activation function, then this MLP is collapsed to a single layer perceptron, say equipped with weight W. As a consequence tf.matmul(W, tf.concat([q_vec, d_vec])) can be rewritten as

tf.matmul(W, tf.concat([q_vec, tf.zeros_like(d_vec)]))+ tf.matmul(W, tf.concat([tf.zeros_like(q_vec), d_vec]))

The 2nd term is independent to q_vec thus can be computed in advanced in step 0.

The improvement is minor I guess, but still better than nothing.

Ekkalak-T commented 6 years ago

@hanxiao Thanks for the brilliant idea. Could you please share us about how to choose a model in the metric layer?

I found that when using cosine or l2, the calculated model loss is not an absolute value between 0-1 so it is hard to monitor whether the model is learning.

I ended up using MLP in the metric layer and changed reduce_sum to reduce_mean in the model_loss for averaging losses over the batch. As a result, the model loss is now between 0-1.

model_loss = tf.reduce_mean(loss_q_pos)

(Let's say all query has the same weight so I removed the weight term)

Is this the correct way to monitor the loss?