Retrieval: shared features between query and candidate tower

JV-Nunes commented 12 months ago

I'm developing a system based on the two-tower architecture as defined in Retrieval's tutorial. I tested two approaches during development:

Passing all features to the query_model with the exception of 'item_id' and passing all item features to the candidate_model
Passing all features except item features to the query_model and passing all item features to the candidate_model

Approach 1 guarantees much better training and validation metrics than approach 2. I am using several contextual features in both approaches. Are there any problems with approach 1? All the tutorials I've seen and research I've done do not share features between the towers.

Metrics: Approach 1	Shared Features	Epochs
	0	1	2	3	4	5	6	7	8	9
factorized_top_k/top_1_categorical_accuracy	0.1499	0.2493	0.3120	0.3024	0.3027	0.2852	0.2891	0.2891	0.2774	0.2825
loss	77.9557	8.9393	3.1107	2.1203	1.8437	1.6977	1.6517	1.5932	1.5578	1.5319
regularization_loss	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
total_loss	77.9557	8.9393	3.1107	2.1203	1.8437	1.6977	1.6517	1.5932	1.5578	1.5319
val_factorized_top_k/top_1_categorical_accuracy	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
val_loss	537.8380	257.5818	183.3954	155.2823	142.6167	134.3614	129.6871	126.0445	124.0921	122.6218
val_regularization_loss	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
val_total_loss	537.8380	257.5818	183.3954	155.2823	142.6167	134.3614	129.6871	126.0445	124.0921	122.6218

Approach 2	Not Shared Features	Epochs
	0	1	2	3	4	5	6	7	8	9
factorized_top_k/top_1_categorical_accuracy	0.0352	0.0338	0.0600	0.0786	0.0772	0.0939	0.1009	0.1163	0.1353	0.1555
loss	239.7879	75.0281	49.3811	29.5558	21.9984	18.2773	15.3387	13.0787	11.4170	9.9223
regularization_loss	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
total_loss	239.7879	75.0281	49.3811	29.5558	21.9984	18.2773	15.3387	13.0787	11.4170	9.9223
val_factorized_top_k/top_1_categorical_accuracy	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
val_loss	1446.6731	825.4975	761.7341	738.1477	721.4468	741.6263	769.6769	806.4929	844.6812	873.6269
val_regularization_loss	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
val_total_loss	1446.6731	825.4975	761.7341	738.1477	721.4468	741.6263	769.6769	806.4929	844.6812	873.6269

caesarjuly commented 12 months ago

Yes. The main problem comes from serving. When inference, you have to generate user embedding first and then use this embedding to search the relevant item embedding in ANN search engine like FAISS. This is a tradeoff between serving latency and model performance. If you mix item features into user part, then for each item, you have to re-generate the user embedding again.

patrickorlando commented 12 months ago

Hi @JV-Nunes, Any feature about the positive item should not be passed to the query tower. There are two problems with mixing them

You are leaking information about the target item. This will of course improve the metrics.
During inference, you do not know the target item. What values for the item features will you pass to the query tower then? You could score each item in your corpus individually, but this is both inefficient for large corpuses and should be modelled as a ranking problem with a more expressive model architecture.

The goal of the two tower architecture is efficient retrieval and it relies on two facts:

The candidate embeddings can be pre-calculated and don't depend on the query.
The query doesn't depend on the item, it can be embedded once and this embedding can be used to score all possible items (in practice approximate nearest neighbour search is used.)

JV-Nunes commented 12 months ago

@caesarjuly @patrickorlando Thank you for the valuable referral. I have one more question:

When prototyping my model, I performed the prediction via bruteforce index, using each user's last interaction to generate recommendations. That is, all features from the last interaction, including item features like 'item_name' or 'item_category'. Does this then imply that the retrieval I'm performing is just fetching the item itself from the last interaction? In other words, would the system always recommend only the last item that was purchased?

The correct approach would be just use user + interaction features into the query model, excluding all item features from inference right?

JV-Nunes commented 12 months ago

Regarding the serving part, do you recommend any guide or material on the best way to carry out the deployment? In my case there is no need for real-time inference, so it can be done in batch without any problems.

caesarjuly commented 12 months ago

@caesarjuly @patrickorlando Thank you for the valuable referral. I have one more question:

When prototyping my model, I performed the prediction via bruteforce index, using each user's last interaction to generate recommendations. That is, all features from the last interaction, including item features like 'item_name' or 'item_category'. Does this then imply that the retrieval I'm performing is just fetching the item itself from the last interaction? In other words, would the system always recommend only the last item that was purchased?

The correct approach would be just use user + interaction features into the query model, excluding all item features from inference right?

That depends on how you construct your training dataset. It's actually a common approach to put user's behavior sequence in the user tower. But you have to ensure that the target item in the item tower is the next behavior. For example, in the user tower, you are using T-th item as feature and in item tower, you must use T+1-th item as the feature. The principle here is there must be no information leakage, namely no target item information in the user tower.

caesarjuly commented 12 months ago

Regarding the serving part, do you recommend any guide or material on the best way to carry out the deployment? In my case there is no need for real-time inference, so it can be done in batch without any problems.

Here is the official document from Tensorflow https://www.tensorflow.org/recommenders/examples/efficient_serving#approximate_prediction. If your candidate set is small and there is no requirement for real-time inference. You can try any kind of ranking models which mix the user features and target features together. Two tower is designed for online candidate retrieval purpose and may be not suitable for your case.

JV-Nunes commented 12 months ago

@caesarjuly Thank you very much. my candidate set is small indeed, but i do not have explicit feedbacks on my data. Is ranking the better option even with implicit feedback?

caesarjuly commented 12 months ago

@caesarjuly Thank you very much. my candidate set is small indeed, but i do not have explicit feedback on my data. Is ranking the better option even with implicit feedback?

Yes. Implicit labels like click are very widely used in industry. If your requirement is simply predict the likelihood of clicks from a small candidate set. Ranking model is naturally a reasonable choice.

JV-Nunes commented 11 months ago

@caesarjuly My goal is to predict purchase likelihood. The problem is that, in my dataset there are only purchase interactions, which are very sparse. How to produce a ranking model in this context? Wich features would i use to leverage the context needed to predict purchase without other implicit feedbacks?

JV-Nunes commented 11 months ago

For some context, i am using a purchase retail offline dataset. We do not have other interactions data besides of purchase interaction.

caesarjuly commented 11 months ago

@caesarjuly My goal is to predict purchase likelihood. The problem is that, in my dataset there are only purchase interactions, which are very sparse. How to produce a ranking model in this context? Wich features would i use to leverage the context needed to predict purchase without other implicit feedbacks?

Got you. If you have dense labels like clicks, that will be much better. You can consider models like ESMM to realize a multi-task transfer learning. If there is no other labels, you may want to check the solution from Airbnb paper. The key idea to mitigate sparse issue is to group users into buckets.

To address these very common marketplace problems in practice, 
we propose to learn embeddings at a level of listing_type instead of listing_id. 
Given meta-data available for a certain listing_id such as location, price, listing type, capacity, number of beds, etc., 
we use a rule-based mapping defined in Table 3 to determine its listing_type.

Ullar-Kask commented 11 months ago

For some context, i am using a purchase retail offline dataset. We do not have other interactions data besides of purchase interaction.

No, you do have. You have a list of products that the customer has not purchased. This can also be regarded as "interaction" data. We generate negative samples randomly from this list. You can go even further. If you store the list of recommended products and determine that the customer has not purchased some of these products (e.g. in the timeframe of 30..90 days), you can regard the products not purchased as negative samples.

JV-Nunes commented 11 months ago

@Ullar-Kask for sure. Good reminder on your part, we can use the results of the campaigns as additional feedback. I even think that this negative feedback is more reliable because it guarantees that the user received the impression of the item.

tensorflow / recommenders

Retrieval: shared features between query and candidate tower #699