tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.83k stars 274 forks source link

Retrieval: shared features between query and candidate tower #699

Closed JV-Nunes closed 11 months ago

JV-Nunes commented 12 months ago

I'm developing a system based on the two-tower architecture as defined in Retrieval's tutorial. I tested two approaches during development:

  1. Passing all features to the query_model with the exception of 'item_id' and passing all item features to the candidate_model

  2. Passing all features except item features to the query_model and passing all item features to the candidate_model

Approach 1 guarantees much better training and validation metrics than approach 2. I am using several contextual features in both approaches. Are there any problems with approach 1? All the tutorials I've seen and research I've done do not share features between the towers.

Metrics: Approach 1 Shared Features Epochs
0 1 2 3 4 5 6 7 8 9
factorized_top_k/top_1_categorical_accuracy 0.1499 0.2493 0.3120 0.3024 0.3027 0.2852 0.2891 0.2891 0.2774 0.2825
loss 77.9557 8.9393 3.1107 2.1203 1.8437 1.6977 1.6517 1.5932 1.5578 1.5319
regularization_loss 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
total_loss 77.9557 8.9393 3.1107 2.1203 1.8437 1.6977 1.6517 1.5932 1.5578 1.5319
val_factorized_top_k/top_1_categorical_accuracy 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
val_loss 537.8380 257.5818 183.3954 155.2823 142.6167 134.3614 129.6871 126.0445 124.0921 122.6218
val_regularization_loss 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
val_total_loss 537.8380 257.5818 183.3954 155.2823 142.6167 134.3614 129.6871 126.0445 124.0921 122.6218
Approach 2 Not Shared Features Epochs
0 1 2 3 4 5 6 7 8 9
factorized_top_k/top_1_categorical_accuracy 0.0352 0.0338 0.0600 0.0786 0.0772 0.0939 0.1009 0.1163 0.1353 0.1555
loss 239.7879 75.0281 49.3811 29.5558 21.9984 18.2773 15.3387 13.0787 11.4170 9.9223
regularization_loss 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
total_loss 239.7879 75.0281 49.3811 29.5558 21.9984 18.2773 15.3387 13.0787 11.4170 9.9223
val_factorized_top_k/top_1_categorical_accuracy 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
val_loss 1446.6731 825.4975 761.7341 738.1477 721.4468 741.6263 769.6769 806.4929 844.6812 873.6269
val_regularization_loss 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
val_total_loss 1446.6731 825.4975 761.7341 738.1477 721.4468 741.6263 769.6769 806.4929 844.6812 873.6269
caesarjuly commented 12 months ago

Yes. The main problem comes from serving. When inference, you have to generate user embedding first and then use this embedding to search the relevant item embedding in ANN search engine like FAISS. This is a tradeoff between serving latency and model performance. If you mix item features into user part, then for each item, you have to re-generate the user embedding again.

patrickorlando commented 12 months ago

Hi @JV-Nunes, Any feature about the positive item should not be passed to the query tower. There are two problems with mixing them

  1. You are leaking information about the target item. This will of course improve the metrics.

  2. During inference, you do not know the target item. What values for the item features will you pass to the query tower then? You could score each item in your corpus individually, but this is both inefficient for large corpuses and should be modelled as a ranking problem with a more expressive model architecture.

The goal of the two tower architecture is efficient retrieval and it relies on two facts:

  1. The candidate embeddings can be pre-calculated and don't depend on the query.
  2. The query doesn't depend on the item, it can be embedded once and this embedding can be used to score all possible items (in practice approximate nearest neighbour search is used.)
JV-Nunes commented 12 months ago

@caesarjuly @patrickorlando Thank you for the valuable referral. I have one more question:

When prototyping my model, I performed the prediction via bruteforce index, using each user's last interaction to generate recommendations. That is, all features from the last interaction, including item features like 'item_name' or 'item_category'. Does this then imply that the retrieval I'm performing is just fetching the item itself from the last interaction? In other words, would the system always recommend only the last item that was purchased?

The correct approach would be just use user + interaction features into the query model, excluding all item features from inference right?

JV-Nunes commented 12 months ago

Regarding the serving part, do you recommend any guide or material on the best way to carry out the deployment? In my case there is no need for real-time inference, so it can be done in batch without any problems.

caesarjuly commented 12 months ago

@caesarjuly @patrickorlando Thank you for the valuable referral. I have one more question:

When prototyping my model, I performed the prediction via bruteforce index, using each user's last interaction to generate recommendations. That is, all features from the last interaction, including item features like 'item_name' or 'item_category'. Does this then imply that the retrieval I'm performing is just fetching the item itself from the last interaction? In other words, would the system always recommend only the last item that was purchased?

The correct approach would be just use user + interaction features into the query model, excluding all item features from inference right?

That depends on how you construct your training dataset. It's actually a common approach to put user's behavior sequence in the user tower. But you have to ensure that the target item in the item tower is the next behavior. For example, in the user tower, you are using T-th item as feature and in item tower, you must use T+1-th item as the feature. The principle here is there must be no information leakage, namely no target item information in the user tower.

caesarjuly commented 12 months ago

Regarding the serving part, do you recommend any guide or material on the best way to carry out the deployment? In my case there is no need for real-time inference, so it can be done in batch without any problems.

Here is the official document from Tensorflow https://www.tensorflow.org/recommenders/examples/efficient_serving#approximate_prediction. If your candidate set is small and there is no requirement for real-time inference. You can try any kind of ranking models which mix the user features and target features together. Two tower is designed for online candidate retrieval purpose and may be not suitable for your case.

JV-Nunes commented 12 months ago

@caesarjuly Thank you very much. my candidate set is small indeed, but i do not have explicit feedbacks on my data. Is ranking the better option even with implicit feedback?

caesarjuly commented 12 months ago

@caesarjuly Thank you very much. my candidate set is small indeed, but i do not have explicit feedback on my data. Is ranking the better option even with implicit feedback?

Yes. Implicit labels like click are very widely used in industry. If your requirement is simply predict the likelihood of clicks from a small candidate set. Ranking model is naturally a reasonable choice.

JV-Nunes commented 11 months ago

@caesarjuly My goal is to predict purchase likelihood. The problem is that, in my dataset there are only purchase interactions, which are very sparse. How to produce a ranking model in this context? Wich features would i use to leverage the context needed to predict purchase without other implicit feedbacks?

JV-Nunes commented 11 months ago

For some context, i am using a purchase retail offline dataset. We do not have other interactions data besides of purchase interaction.

caesarjuly commented 11 months ago

@caesarjuly My goal is to predict purchase likelihood. The problem is that, in my dataset there are only purchase interactions, which are very sparse. How to produce a ranking model in this context? Wich features would i use to leverage the context needed to predict purchase without other implicit feedbacks?

Got you. If you have dense labels like clicks, that will be much better. You can consider models like ESMM to realize a multi-task transfer learning. If there is no other labels, you may want to check the solution from Airbnb paper. The key idea to mitigate sparse issue is to group users into buckets.

To address these very common marketplace problems in practice, 
we propose to learn embeddings at a level of listing_type instead of listing_id. 
Given meta-data available for a certain listing_id such as location, price, listing type, capacity, number of beds, etc., 
we use a rule-based mapping defined in Table 3 to determine its listing_type.

image

Ullar-Kask commented 11 months ago

For some context, i am using a purchase retail offline dataset. We do not have other interactions data besides of purchase interaction.

No, you do have. You have a list of products that the customer has not purchased. This can also be regarded as "interaction" data. We generate negative samples randomly from this list. You can go even further. If you store the list of recommended products and determine that the customer has not purchased some of these products (e.g. in the timeframe of 30..90 days), you can regard the products not purchased as negative samples.

JV-Nunes commented 11 months ago

@Ullar-Kask for sure. Good reminder on your part, we can use the results of the campaigns as additional feedback. I even think that this negative feedback is more reliable because it guarantees that the user received the impression of the item.