[Question] Difference between ncf and two-tower-model

jillwalker99 commented 1 year ago

Hi all, as I delved further into the topic of recommender systems, the question came up of what the difference is between the two-tower model used here "https://www.tensorflow.org/recommenders/examples/basic_retrieval" and neural collaborative filtering (https://arxiv.org/abs/1708.05031).

Many thanks in advance.

patrickorlando commented 1 year ago

Neural Collaborative Filtering is a class of embedding factorization models where the similarity function between the user and item embedding is learned, (usually by an MLP), as opposed to being a dot-product. The model you have linked can be thought of as a two-tower model (not all of the NCF architectures can), but it would be considered a ranking model and not a retrieval model.

In the NCF case, if you have K items, you will need to inference the model K times to get predictions for a single user. This is expensive and usually becomes unfeasible as quickly as 10K candidates.

This paper, "Neural Collaborative Filtering vs. Matrix Factorization Revisited" suggests that the benefit of using an MLP scoring function is marginal, and requires careful tuning, whilst a dot-product interaction is a robust choice that offers efficient serving that scales to millions of candidate items.

That said, these architectures along with others might be useful in the subsequent ranking stage.

jillwalker99 commented 1 year ago

Thank you for your feedback @patrickorlando. Would that mean that the linked model belongs to the category of class Neural Collaborative Filtering? Is the linked model a ranking model because the output indicates the dot product and therefore the similarity?

patrickorlando commented 1 year ago

If the model uses a learnable layer to calculate the similarity/relevance score, then it may be considered an NCF model, but I wouldn't focus too much on this terminology. The key thing to remember is that if the similarity function is a dot-product (first-order), then it can be efficiently calculated at inference time using Approximate Nearest Neighbour search. It may however not be as powerful as a more expressive model which allows non-linear interactions between the user and item features (Deep Cross Networks, DLRM, Gradient Boosted Tree Ranking, ...). The modern approach to this problem is to break the problem up into stages. (see figure below)

Screenshot 2023-02-07 at 9 20 02 am Source: Deep Neural Networks for YouTube Recommendations

This article from NVIDIA is also a helpful introduction to the concept.

The TFRS library is aligned with the concept of multi-stage recommender systems.

OmarMAmin commented 1 year ago

@patrickorlando, Thanks for the info If the item space is around 4K, is it better to just do the ranking stage directly? Do you know any papers discussing the candidate space size vs ranking alone or retrieval vs ranking ?

jillwalker99 commented 1 year ago

@OmarMAmin I think this depends primarily on the serving time - if this is still in line with the number of candidates, the ranking model should be sufficient.

jillwalker99 commented 1 year ago

@patrickorlando thanks again :) Do you know of any other papers or explanations of the Two-Tower Model being developed in the Tensorflow Guides - to understand the architecture and system in detail (apart from the YouTube paper about the recommender system)?

patrickorlando commented 1 year ago

@OmarMAmin, @jillwalker99 is correct, you may choose to implement a ranking only model provided serving time and cost is within budget. There is one other benefit to having a dot-product scoring function, namely that user and item embeddings will be embedded in a vector space with distance metrics. Items and users that are similar will be close together based on cosine/euclidean distances. You can cluster items or users, ensure that items returned at not too similar, etc.

patrickorlando commented 1 year ago

@jillwalker99 The concept of the two-tower recommender model is closely related to the Dual Encoder in Information Retrieval literature. Here's some papers that might be of interest:

jillwalker99 commented 1 year ago

Hi @patrickorlando I have two questions again 1. Where is the dot product calculated within the Two Tower Model in Tensorflow (https://www.tensorflow.org/recommenders/examples/basic_retrieval)?

the Two Tower Model is a classification from the perspective of the machine learning task. Is it a multi class classification where every possible interaction represents a class or is it a binary classification between positive interactions and all others? Or how can this be understood?

patrickorlando commented 1 year ago

It is calculated in the retrieval task, https://github.com/tensorflow/recommenders/blob/7caed557b9d5194202d8323f2d4795231a5d0b1d/tensorflow_recommenders/tasks/retrieval.py#L160-L161
It is modelled as a massive multi class classification problem. Every candidate is a class. However candidates are sampled in each batch as opposed to calculating all possible classes for each batch. This is called a Sampled Softmax Loss. See the discussion in #334 for further details.

jillwalker99 commented 1 year ago

Thank you very much for your help @patrickorlando . So for each user it tries to predict the class (candidate / item). The correct class is the positive interaction of the user with an item, right? One more question about development: what if a dataset is used that is based on a recommendation system for popularity (recommends the x most popular products based on sales). Then the offline evaluation with e.g. Top X Accuracy is distorted and live A/B tests are necessary or am I seeing something wrong here (as probably niche products are sold even less - as the recommendation system which was used before does not recommend them)?

patrickorlando commented 1 year ago

Yes to both questions.

Evaluating recommender systems is hard. Online A/B tests are useful. A model that performs well on a dataset doesn't guarantee that you have a great recommender system. In general it's important to make sure that a model is not the only way users interact with items, otherwise you create a feedback loop and your overall system can get stuck in local minima. In your case, your model will be biased to popular items. You can down-weight or subsample these interactions, and you should think about how to include exploration into your system for future data collection.

jillwalker99 commented 1 year ago

Hi @patrickorlando is it correct to add a dropout layer for the query model and the candidate_model or what would be the right approach here?

`self.candidate_model= tf.keras.Sequential([ Item_Model(), tf.keras.layers.Dropout(0.1), tf.keras.layers.Dense(64)])

self.query_model= tf.keras.Sequential([ User_Model(), tf.keras.layers.Dropout(0.1), tf.keras.layers.Dense(64)])`

patrickorlando commented 1 year ago

Hi @jillwalker99, Sure, dropout can be added to your query and item towers, but you probably will need to tune this parameter. You also might want to add L-2 normalization as the last layer for each tower and tune the temperature parameter for your retrieval task. See the discussion in #633.

jillwalker99 commented 1 year ago

Thank you @patrickorlando do you mean the "kernel_regularizer=tf.keras.regularizers.L2(0.001)" or how is it possible to implement L2 normalization? What exactly does the L2 normalization do - I thought it allowed, among other things, the determination of cosine similarity instead of the dot product. Or what is the primary reason for using it?

patrickorlando commented 1 year ago

L2 normalization scales the vector by its euclidean length. It means that the outputs of your query and candidate towers will be constrained to the unit sphere. As the paper referenced in #633 states, this improves model training behaviour, but requires that the softmax temperature (which then scales the dot-product scores) to be tuned carefully.

class L2Normalization(tf.keras.layers.Layer):
    def __init__(self, axis=-1, **kwargs):
        super().__init__(**kwargs)
        self._axis = axis

    def call(self, inputs):
        return tf.linalg.l2_normalize(inputs, axis=self._axis)

    def get_config(self):
        return {"axis": self._axis}

or simply

l2_norm = tf.keras.layers.Lambda(lambda x:  tf.linalg.l2_normalize(x, axis=-1))

jillwalker99 commented 1 year ago

thanks again :) would it make sense to also use kernel_regularizer for the other hidden layers?

patrickorlando commented 1 year ago

@jillwalker99, perhaps, there is no one size fits all approach.

tensorflow / recommenders

[Question] Difference between ncf and two-tower-model #628