tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.79k stars 269 forks source link

How to create a feature in the candidate model that depends (varies) for each row in the dataset? #610

Open msoutojr opened 1 year ago

msoutojr commented 1 year ago

Hi everyone.

I'm implementing a multitask model to predict the next item in the basket.

In my candidate model, I would like to insert a feature that represents the score (likelihood) between the current item in the basket and each candidate.

This means that this feature must be recalculated for each row in the dataset (because the current item will be different).

This score would be calculated by another model (a retrieval item-to-item). In this case, I must pass the current item in the basket into this item-to-item model and take that score to populate the feature I want in the candidates.

The issue is that I will need the "current item" in the basket for each row of the dataset and I don't have access to it in the model initialization.

I've tried setting the task retrieval in init() without the metrics (and candidates) during initialization and setting the metrics (and candidates) in the method call() (since in this case I have access to the batch of current items in the basket). But I got an error: metrics "read-only @object property".

AttributeError: Can't set the attribute "metrics", likely because it conflicts with an existing read-only @property of the object. Please choose a different name.

Is there another way to map this feature into the candidate model that depends on the current item in the basket (each row in the dataset)?

class Score_Multitask(tfrs.models.Model):

      def __init__(self, n_past = 1, rnn='GRU', dense_units= [128,64], 
                   dense_activation="relu", embedding_dimension=64, 
                   l2 = 1e-4, pool='average', rating_weight=0.5, 
                   retrieval_weight=0.5, retrieval_weight_2=0.1, submodel=None 
                   ) -> None:

            super().__init__()

            # Retrieval Item Model
            self.retrieval_item_model = submodel

            # Query and Candidate models.       
            self.query_model = QueryModel(rnn=rnn, n_past=n_past, 
                                    embedding_dimension=embedding_dimension, 
                                    pool=pool, dense_units=dense_units,  l2=l2)

            self.candidate_model = CandidateModel(
                embedding_dimension=embedding_dimension, pool=pool, 
                dense_units=dense_units,  l2=l2)

            # Raking Model
            self.rating_model = tf.keras.Sequential([
                tf.keras.layers.Dense(256, activation="relu"),
                tf.keras.layers.Dense(128, activation="relu"),
                tf.keras.layers.Dense(1),
            ])

            # Retrieval tasks user-item:
            # I tried to create the task retrieval in 
            #__init__() without the metrics (and candidates)   
            self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval()

            # Ranking tasks. 
            self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
                loss=tf.keras.losses.MeanSquaredError(),
                metrics=[tf.keras.metrics.RootMeanSquaredError()],
            )

            # The loss weights.
            self.rating_weight = rating_weight
            self.retrieval_weight = retrieval_weight

      def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:

            # query embedding
            query_embeddings = self.query_model({
                "user_id": features["user_id"],
                "weekend": features["weekend"],
                "order_hour_of_day": features["order_hour_of_day"],
                "add_to_cart_order": features["add_to_cart_order"],
                "days_since_prior_order" : features["days_since_prior_order"],
                "past_product_id": features["past_product_id"],
                "past_department_id": features["past_department_id"],
                "past_product_name": features["past_product_name"],
            })

            # candidate embedding
            candidate_embeddings = self.candidate_model({
                "product_id": features["product_id"],
                "department_id": features["department_id"],
                "product_name": features["product_name"],
                "score": self.retrieval_item_model.compute_score(x,
                                                    features["past_product_id"])
            })

            # create candidates with the score feature
            candidates = products.batch(1024).map(lambda x: {
                "product_id": x["product_id"],
                "product_name": x["product_name"],
                "department_id": x["department_id"],
                "score": self.retrieval_item_model.compute_score(features, 
                                                    features["past_product_id"])
            })

            # here is the problem. I can't set the metrics (ready only)
            self.retrieval_task.metrics = tfrs.metrics.FactorizedTopK(
                    candidates=candidates,
                    ks= (5, 10),
                    name='user_item'
                )

            return (
                      query_embeddings,
                      candidate_embeddings,
                      self.rating_model(
                            tf.concat([query_embeddings, 
                                       candidate_embeddings], axis=1)
                            ),
                  )

      def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:

            ratings = features['rating']

            query_embeddings, candidate_embeddings, rating_predictions = self(features)

            # We compute the loss for each task.
            rating_loss = self.rating_task(
                labels=ratings,
                predictions=rating_predictions,
            )
            retrieval_loss = self.retrieval_task(query_embeddings, candidate_embeddings)

            # And combine them using the loss weights.
            return (self.rating_weight * rating_loss
                    + self.retrieval_weight * retrieval_loss
                    )

            return retrieval_loss
maciejkula commented 1 year ago

I think what you're trying to do is fundamentally incompatible with retrieval models.

These are "factorized" models: that is, the candidate features cannot depend on the query features (in your case, the current item in the basket). This is what enables efficient retrieval, as candidate representations do not need to be recomputed for each query.

The setup you're thinking of is normally accomplished via a two-stage pipeline, with a retrieval model first and a ranking model (that has fewer limitations) later.

hkristof03 commented 1 year ago

@maciejkula I think my question is related to this topic and maybe you already answered it. I was struggling to include prices of items. I am working on an item-item recommender now, where the price could appear on both sides (but the problem stands for user-item recommendations as well). The problem is that even if I rank the prices from 1 to 5 (for example based on similar items in the given category), over time the prices change and therefore the number of items goes up, which affects the retrieval index as well.

I guess the correct solution is to include the price information only in the Ranking model, but not in the Retrieval model, right?

Thanks in advance!