tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.83k stars 274 forks source link

[Question] Unclear incorporation of user or item metadata #565

Closed EdwardALockhart closed 2 years ago

EdwardALockhart commented 2 years ago

Hi,

I have been going through the tutorials and one question remains which is the incorporation of user and item metadata.

I have seen examples where both user and item metadata are incorporated into the User Model (Query Model - the interchangeable names can be confusing). For example here https://github.com/drtinumohan/tfrs_amazon_dataset/blob/main/tfrs_amazon.ipynb, there are user attributes (residence) and item attributes (cabin type) being incorporated into the User Model. Why not incorporate them both or one of them in the Item Model (Candidate Model)?

Is there any standard approach detailing where such features should be incorporated and how they might fit into say the basic quick start example here? https://www.tensorflow.org/recommenders/examples/quickstart

I know that both layers will have to have vocabs produced to convert to integers, but beyond that, what rules should be followed in terms of slotting these metadata into a simple retrieval model like in the quick start example?

Thanks!

patrickorlando commented 2 years ago

I would suggest you think in terms of queries and candidates. A query is the context for which you want to retrieve a set of candidates. The query input features should not contain information about the target candidate that is not available at inference time.

In the example above, the user is performing a search. The query is the USER_ID, the user's country USER_RESIDENCE and the desired travel class CABIN_TYPE. All of this can be provided at inference time.

So the distinction is pretty clear. You can add information about the user or the recommendation context to the query tower and you can add item features to the candidate tower.

EdwardALockhart commented 2 years ago

I see, this makes a lot more sense now when you try to write the code and get recommendations for a specific user - you also have to supply the other characteristics required by the query model, some of which you couldn't possibly know at inference time such as the bought item and its characteristics.

Thanks!

Though now that I know the user and their characteristics at inference time (inputs into the Query Model), I can supply the item and its characteristics as candidates (Candidate Model). How do I produce a recommendation in this instance?

I can't seem to find an example of running this prediction stage with data other than item IDs as a list of candidates and user ID as a query. My goal here is to learn that some items are related to one another by categories and the same for users but all that should be recommended is an item.

Below is the code that I am using with some lines omitted for clarity

ratings = tf.data.Dataset.from_tensor_slices(df[['user', 'item', 'item_type', 'user_type', 'strength']].to_dict(orient = 'list'))
items = tf.data.Dataset.from_tensor_slices(df[['item', 'item_type']].drop_duplicates().to_dict(orient = 'list'))

...

class QueryModel(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.user_layers = user_layers
        self.user_type_layers = user_type_layers

    def call(self, features):
        return tf.concat([self.user_layers(features["user"]), 
                          self.user_type_layers(features["user_type"])], axis = 1)

class CandidateModel(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.item_layers = item_layers
        self.item_type_layers = item_type_layers

    def call(self, features):
        return tf.concat([self.item_layers(features["item"]), 
                          self.item_type_layers(features["item_type"])], axis = 1)

query_model = tf.keras.Sequential(QueryModel())
candidate_model = tf.keras.Sequential(CandidateModel())
retrieval_task = tfrs.tasks.Retrieval(metrics = tfrs.metrics.FactorizedTopK(items.batch(128).map(candidate_model), ks = [5, 10]))

...

class RetrievalModel(tfrs.models.Model):
    def __init__(self, query_model, candidate_model, retrieval_task):
        super().__init__()
        self.query_model: tf.keras.Model = query_model
        self.candidate_model: tf.keras.Model = candidate_model
        self.retrieval_task: tf.keras.layers.Layer = retrieval_task

    def compute_loss(self, features, training = False):
        query_embeddings = self.query_model(features)
        positive_candidate_embeddings = self.candidate_model(features)
        return self.retrieval_task(query_embeddings,
                                   positive_candidate_embeddings,
                                   compute_metrics = not training)

# Train and test
retrieval_model = RetrievalModel(query_model,
                                 candidate_model,
                                 retrieval_task)
retrieval_model.compile(optimizer = tf.keras.optimizers.Adagrad(learning_rate = 0.1))
retrieval_model.fit(train,
                    validation_data = test,
                    epochs = 10)
retrieval_results = retrieval_model.evaluate(test, return_dict = True)

# Get candidate recommendations
index = tfrs.layers.factorized_top_k.BruteForce(retrieval_model.query_model)
index.index_from_dataset(tf.data.Dataset.zip((items.batch(100),
                                              items.batch(100).map(retrieval_model.candidate_model))))

When I try to generate the recommendations, I get an error on the last line:

  File "/tmp/ipykernel_3209/2589562812.py", line 1, in <cell line: 1>
    index.index_from_dataset(tf.data.Dataset.zip((items.batch(100),

  File "/mnt/e0fdda2b-8695-46fc-b7ef-788e3852324c/DataG7/Computing/Python/VirtualEnvironments/tf/lib/python3.10/site-packages/tensorflow_recommenders/layers/factorized_top_k.py", line 197, in index_from_dataset
    _check_candidates_with_identifiers(candidates)

  File "/mnt/e0fdda2b-8695-46fc-b7ef-788e3852324c/DataG7/Computing/Python/VirtualEnvironments/tf/lib/python3.10/site-packages/tensorflow_recommenders/layers/factorized_top_k.py", line 127, in _check_candidates_with_identifiers
    if candidates_spec.shape[0] != identifiers_spec.shape[0]:

AttributeError: 'dict' object has no attribute 'shape'

I can bring across both attributes by simply concatenating as the code below. Does this seem sensible? From https://github.com/tensorflow/recommenders/issues/318#issuecomment-1102099467. I fear this might impact on the ranking stage later when these recommendations are re-ranked due to their format.

index = tfrs.layers.factorized_top_k.BruteForce(retrieval_model.query_model)
index.index_from_dataset(items.batch(100).map(lambda x: (x['item'] + x['item_type'],
                                                         retrieval_model.candidate_model(x))))

I can get out a recommendation as item + item_type for a single user with characteristics using

query = {"user": tf.constant(["user_MarcGimbel"]), "user_type": tf.constant(["United States"])}
affinity_scores, recommended_items = index(query)

Supplying user and user_type is no problem for inference, but I am unsure about the candidate model using item and item_type as I can't think of how else it can learn that they are related despite my desire for just recommending items without the item_type information.

I'm quite new to TensorFlow, so the fact that I'm at this stage is a testament to how good the tutorials are, just some help is required with niche areas that aren't covered.

Thanks

patrickorlando commented 2 years ago

Yep, exactly. Thinking about how you intend to use the model typically can clear up what features you can use and where.

You are misunderstanding the index_from_dataset method.

https://github.com/tensorflow/recommenders/blob/8b249f3fc0f8d3d907eecf010809a5df3759d65d/tensorflow_recommenders/layers/factorized_top_k.py#L175-L189

You need to map your dataset so that it consists of tuples (item_id, item_vector), as follows.

index = tfrs.layers.factorized_top_k.BruteForce(retrieval_model.query_model)
index.index_from_dataset(items.batch(100).map(
    lambda inputs: (inputs['item'], retrieval_model.candidate_model(inputs))
))
EdwardALockhart commented 2 years ago

Thanks for this. Because I'm supplying metadata with my items in the form or item + item_type... when I map items like you suggest then the index obviously contains duplicates. So for a given query the same item can appear multiple times as the item_type is omitted which would have allowed their differentiation.

Is there no way to supply items with metadata to the model while just getting unique items out? Or do you have to supply metadata (fudging the index_from_dataset bit as I did) or just supply items as you have suggested and de-duplicate the final list?

patrickorlando commented 2 years ago

The id is used for nothing more than identifying the recommended item. It should be unique for each candidate. I don't fully understand how you can have the same item with different type, but if you need to create a new composite id to make it work then that's fine.

EdwardALockhart commented 2 years ago

I completely understand now. My items were a higher category (think airlines) and my items types were a level below (seat types under an airline category). So since I was selecting the higher level as my item, I captured all of the different combinations of that with its sub types. So when I removed the subtypes the higher categories were obviously duplicated. My problem was selecting that higher category as an item... I should concatenation them and treat them as distinct items (if I wanted to predict airline and seat type) or just omit any information below my items (removing the sub types and ending up with a unique airline list if I wanted to predict that) as it wouldn’t help learn any relationship of the higher categories anyway.

Thank you so much!