Similar dataset but recommendations always the same for all users

tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.

Apache License 2.0

1.79k stars 270 forks source link

Similar dataset but recommendations always the same for all users #322

Open tansaku opened 3 years ago

tansaku commented 3 years ago

Having successfully got my data into the correct format I've been able to run the quickstart recommender on my own dataset. However for some reason on a given run every user gets identical recommendations for the same item over and over again.

Looking at the movielens data in the same I see that it has the following properties:

944 users
1665 movies
100,000 mappings (each user rating ~100 movies)

My own data set has the following properties

4867 users
76 interests
24540 mappings (each user having ~5 interests)

I think I've gotten the data into the correct format. For example, the movie data mappings are like this:

b"One Flew Over the Cuckoo's Nest (1975)" b'138'
b'Strictly Ballroom (1992)' b'92'
b'Very Brady Sequel, A (1996)' b'301'
b'Pulp Fiction (1994)' b'60'
b'Scream 2 (1997)' b'197'

while I've now got the interest data similarly structured like so

b'Books' b'1242047'
b'Dance' b'91242048'
b'Sustainability' b'2870269'
b'Books' b'3970361'
b'Photography' b'3970362'

however the recommended interests with my dataset are locked into the same thing each time.

Running with the movielens data there's a nice spread of recommendations, and the training output is like this:

Epoch 1/3
25/25 [==============================] - 8s 260ms/step - factorized_top_k/top_1_categorical_accuracy: 7.0000e-05 - factorized_top_k/top_5_categorical_accuracy: 0.0015 - factorized_top_k/top_10_categorical_accuracy: 0.0047 - factorized_top_k/top_50_categorical_accuracy: 0.0445 - factorized_top_k/top_100_categorical_accuracy: 0.1001 - loss: 33082.5255 - regularization_loss: 0.0000e+00 - total_loss: 33082.5255
Epoch 2/3
25/25 [==============================] - 6s 246ms/step - factorized_top_k/top_1_categorical_accuracy: 1.9000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0047 - factorized_top_k/top_10_categorical_accuracy: 0.0136 - factorized_top_k/top_50_categorical_accuracy: 0.1065 - factorized_top_k/top_100_categorical_accuracy: 0.2112 - loss: 31007.2517 - regularization_loss: 0.0000e+00 - total_loss: 31007.2517
Epoch 3/3
25/25 [==============================] - 6s 242ms/step - factorized_top_k/top_1_categorical_accuracy: 2.3000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0078 - factorized_top_k/top_10_categorical_accuracy: 0.0212 - factorized_top_k/top_50_categorical_accuracy: 0.1432 - factorized_top_k/top_100_categorical_accuracy: 0.2669 - loss: 30418.3815 - regularization_loss: 0.0000e+00 - total_loss: 30418.3815
Top 3 recommendations for user 42: [b'Rent-a-Kid (1995)' b'Only You (1994)' b'Just Cause (1995)']

and we can see a nice pattern of recommendations for users:

thing, titles = index(np.array(["1"]))
print(f"Top 10 recommendations for user {user_ids_vocabulary.get_vocabulary()[1]}: {titles}")
print(f"thing: {thing}")
thing, titles = index(np.array(["2"]))
print(f"Top 10 recommendations for user {user_ids_vocabulary.get_vocabulary()[2]}: {titles}")
print(f"thing: {thing}")

Top 10 recommendations for user 405: [[b'Doom Generation, The (1995)'
  b'Brother Minister: The Assassination of Malcolm X (1994)'
  b'Theodore Rex (1995)' b'Nadja (1994)'
  b'Turbo: A Power Rangers Movie (1997)'
  b'All Dogs Go to Heaven 2 (1996)' b'Kansas City (1996)'
  b'Maya Lin: A Strong Clear Vision (1994)' b'White Balloon, The (1995)'
  b'Flipper (1996)']]
thing: [[4.8432746 4.619417  4.452347  4.371699  3.4825168 3.1067772 3.0736032
  3.0411866 3.0028658 2.8818445]]
Top 10 recommendations for user 655: [[b'3 Ninjas: High Noon At Mega Mountain (1998)' b'Promesse, La (1996)'
  b'For the Moment (1994)' b'City of Angels (1998)'
  b"Antonia's Line (1995)" b"Marvin's Room (1996)"
  b'Once Upon a Time... When We Were Colored (1995)'
  b'Unhook the Stars (1996)' b'Kolya (1996)' b'Secrets & Lies (1996)']]
thing: [[9.494508  6.5379915 5.68907   5.3878336 5.3482184 5.290077  5.2577763
  5.2102613 5.1238346 5.1227913]]

but for the interests data we get this:

Epoch 1/3
6/6 [==============================] - 10s 2s/step - factorized_top_k/top_1_categorical_accuracy: 0.1213 - factorized_top_k/top_5_categorical_accuracy: 0.1213 - factorized_top_k/top_10_categorical_accuracy: 0.1213 - factorized_top_k/top_50_categorical_accuracy: 0.1238 - factorized_top_k/top_100_categorical_accuracy: 0.1293 - loss: 31238.8574 - regularization_loss: 0.0000e+00 - total_loss: 31238.8574
Epoch 2/3
6/6 [==============================] - 10s 2s/step - factorized_top_k/top_1_categorical_accuracy: 0.2418 - factorized_top_k/top_5_categorical_accuracy: 0.2418 - factorized_top_k/top_10_categorical_accuracy: 0.2418 - factorized_top_k/top_50_categorical_accuracy: 0.2457 - factorized_top_k/top_100_categorical_accuracy: 0.2508 - loss: 37982.0569 - regularization_loss: 0.0000e+00 - total_loss: 37982.0569
Epoch 3/3
6/6 [==============================] - 9s 2s/step - factorized_top_k/top_1_categorical_accuracy: 0.1624 - factorized_top_k/top_5_categorical_accuracy: 0.1624 - factorized_top_k/top_10_categorical_accuracy: 0.1624 - factorized_top_k/top_50_categorical_accuracy: 0.1667 - factorized_top_k/top_100_categorical_accuracy: 0.1732 - loss: 31066.0243 - regularization_loss: 0.0000e+00 - total_loss: 31066.0243
Top 3 recommendations for user 42: [b'Spa' b'Spa' b'Spa']

and here's the details for the first two users:

Top 10 recommendations for user 990000155054: [[b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa']]
thing: [[0.194 0.194 0.194 0.194 0.194 0.194 0.194 0.194 0.194 0.194]]
Top 10 recommendations for user 990000154983: [[b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa' b'Spa']]
thing: [[0.194 0.194 0.194 0.194 0.194 0.194 0.194 0.194 0.194 0.194]]

On different runs the recommendation might recommend a different interest, but it always gets stuck with the same one for all users. I've tried longer training runs, but I'm starting to wonder if there's a need to have some minimum number of "ratings" per user? I think that's the main difference between the data sets, i.e. 100 ratings for each user in the movielens dataset, but only five in my interests dataset. Or could it be the number of possible interests is just too small?

Or I'm making some stupid mistake in the code (in the interest vocabulary lookup table perhaps?). What are the requirements on the dataset size to allow this model to work?

Is there anything in the output I show above to indicate what's going wrong? The net getting stuck in a local minima perhaps?

I've tried longer training runs but that doesn't seem to make any difference. Perhaps the settings for the retrieval model need to be different given the fewer "ratings" or some other difference in the dataset proportions?

tansaku commented 3 years ago

the strange thing is that if I reduce the amount of movielens data being used (take the first 100 mappings), I still get sensible output from the model, e.g.

96 users
1665 movies
100 mappings

gives the following

ranking, titles = index(np.array(["1"]))
print(f"Top 10 recommendations for user {user_ids_vocabulary.get_vocabulary()[1]}: {titles}")
print(f"thing: {ranking}")
ranking, titles = index(np.array(["2"]))
print(f"Top 10 recommendations for user {user_ids_vocabulary.get_vocabulary()[2]}: {titles}")
print(f"thing: {ranking}")

Top 10 recommendations for user 699: [[b'Harold and Maude (1971)' b'Rock, The (1996)'
  b'Mulholland Falls (1996)' b'Four Weddings and a Funeral (1994)'
  b'Aladdin (1992)' b'Sense and Sensibility (1995)' b'Local Hero (1983)'
  b'Jungle2Jungle (1997)' b"Antonia's Line (1995)"
  b'Man Without a Face, The (1993)']]
thing: [[0.16602841 0.14439449 0.13890035 0.13128598 0.11802915 0.11523528
  0.11077308 0.10783543 0.09914517 0.09864655]]
Top 10 recommendations for user 663: [[b'Harold and Maude (1971)' b'Rock, The (1996)'
  b'Mulholland Falls (1996)' b'Four Weddings and a Funeral (1994)'
  b'Aladdin (1992)' b'Sense and Sensibility (1995)' b'Local Hero (1983)'
  b'Jungle2Jungle (1997)' b"Antonia's Line (1995)"
  b'Man Without a Face, The (1993)']]
thing: [[0.16602841 0.14439449 0.13890035 0.13128598 0.11802915 0.11523528
  0.11077308 0.10783543 0.09914517 0.09864655]]

implying that the number of mappings is not an issue - so could it just be the number of possible values of user id and movies - i.e. we've got over a 1000 movies, but the <100 interests in my data is the problem?

But even if I reduce the number of movies to 100 I still get great recommendations (although they are the same for each user) - so either I've got some silly broken thing in my approach, or my reduction in movielens data is not actually coming through due to some caching mechanism ...

GaetanDu commented 3 years ago

I was facing a similar issue. My dataset:

1500 user_id
83498 books
167813 mapping

then after training i had:

Recommendations for user 42: [b'clara callan' b'clara callan' b'clara callan']

That was due to my books dataset respectively movies dataset in the example.

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
})
movies = movies.map(lambda x: x["movie_title"])

Movies is supposed to store unique representation of each movie as it is used in FactorizedTopK.

In my case books = books.map(lambda x: x["book_title"]) books were not unique.

After preprocessing my model gives me:

[b'clara callan', b'flu: the story of the great influenza pandemic of 1918 and the search for the virus that caused it', b"the kitchen god's wife"]

Hope it can help!

tansaku commented 3 years ago

wow thanks @GaetanDu that makes a lot of sense - was there a simple map operation to ensure uniqueness - converting to a set and then back to a list again maybe?

tansaku commented 3 years ago

I did just have a look at my interests, and printing them out, they are all unique ... as are the user ids ... ah, wait, but that's after they have already been processed as part of the vocabulary adaptation step ...

GaetanDu commented 3 years ago

My data is a pandas dataframe i didn't use tensorflow to make unique representation, here is how i preprocess:

unique_book_titles_df = pd.DataFrame(overall_data.book_title.unique(), columns=['book_title'])

books = {key: col.values for key, col in dict(overall_data[['book_title']]).items()}
books = tf.data.Dataset.from_tensor_slices(books)

you find books/movies in FactorizedtopK and index method from bruteforce.

tansaku commented 3 years ago

thanks @GaetanDu - really appreciate you sharing that

I just tried

unique_interests = set()

for i in interests.take(25000):
  unique_interests.add(i.numpy())
print(unique_interests)

unique_users = set()
for i in users.take(25000):
  unique_users.add(i.numpy())

users = tf.data.Dataset.from_tensor_slices(list(unique_users))
user_ids_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(users)

interests = tf.data.Dataset.from_tensor_slices(list(unique_interests))
interests_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
interests_vocabulary.adapt(interests)

and this runs, and I get a variety of output for an individual recommendation, but each recommendations for each different user is the same as the others ...

getting

[[b'Gaming' b'Animation' b'Tech' b'Design' b'Bridal' b'Couture' b'Dance' b'Personal Fitness' b'Musical' b'Gardening']]

is better than

[[b'Comedy' b'Comedy' b'Comedy' b'Comedy' b'Comedy' b'Comedy' b'Comedy' b'Comedy' b'Comedy' b'Comedy']]

but I would still expect each user to get some variation in their set of recommendations

do you get different recommendations for different users ...?

GaetanDu commented 3 years ago

Yes i have different recommendations and what are you giving to:

 tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=books.batch(128).map(self.book_model)
            )
        )

and

# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset.
index.index(books.batch(1000).map(model.book_model), books)

As you can see books is unique because i created it from dataframe.unique().

My entire model:

dataset = {key: col.values for key, col in dict(overall_data[['user_id', 'book_title', 'rating']]).items()}
dataset = tf.data.Dataset.from_tensor_slices(dataset).prefetch(tf.data.AUTOTUNE)

unique_book_titles_df = pd.DataFrame(overall_data.book_title.unique(), columns=['book_title'])

books = {key: col.values for key, col in dict(unique_book_titles_df).items()}
books = tf.data.Dataset.from_tensor_slices(books).prefetch(tf.data.AUTOTUNE)

ratings = dataset.map((lambda x: {
    "book_title": x["book_title"],
    "user_id": x["user_id"],
    "user_rating": x["rating"]
}), num_parallel_calls=tf.data.AUTOTUNE)

books = books.map(lambda x: x["book_title"], num_parallel_calls=tf.data.AUTOTUNE)

tf.random.set_seed(42)
shuffled = ratings.shuffle(26562, seed=42, reshuffle_each_iteration=False)

train = ratings.take(24000)
test = ratings.skip(24000).take(2562)

unique_user_ids = overall_data.user_id.unique()
unique_book_titles = overall_data.book_title.unique()

class BookModel(tfrs.models.Model):

    def __init__(self, rating_weight: float, retrieval_weight: float) -> None:
        # We take the loss weights in the constructor: this allows us to instantiate
        # several model objects with different loss weights.

        super().__init__()

        embedding_dimension = 32

        # User and book models.
        self.book_model: tf.keras.layers.Layer = tf.keras.Sequential([
          tf.keras.layers.experimental.preprocessing.StringLookup(
            vocabulary=unique_book_titles, mask_token=None),
          tf.keras.layers.Embedding(len(unique_book_titles) + 1, embedding_dimension)
        ])
        self.user_model: tf.keras.layers.Layer = tf.keras.Sequential([
            tf.keras.layers.experimental.preprocessing.StringLookup(
              vocabulary=unique_user_ids, mask_token=None),
          tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
        ])

        # A small model to take in user and book embeddings and predict ratings.
        # We can make this as complicated as we want as long as we output a scalar
        # as our prediction.
        self.rating_model = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(1),
        ])

        # The tasks.
        self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
            loss=tf.keras.losses.MeanSquaredError(),
            metrics=[tf.keras.metrics.RootMeanSquaredError()],
        )
        self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=books.batch(128).map(self.book_model)
            )
        )

        # The loss weights.
        self.rating_weight = rating_weight
        self.retrieval_weight = retrieval_weight

    def call(self, features) -> tf.Tensor:
        # We pick out the user features and pass them into the user model.
        user_embeddings = self.user_model(features["user_id"])
        # And pick out the book features and pass them into the book model.
        book_embeddings = self.book_model(features["book_title"])

        return (
            user_embeddings,
            book_embeddings,
            # We apply the multi-layered rating model to a concatentation of
            # user and book embeddings.
            self.rating_model(
                tf.concat([user_embeddings, book_embeddings], axis=1)
            ),
        )

    def compute_loss(self, features, training=False) -> tf.Tensor:

        ratings = features.pop("user_rating")

        user_embeddings, book_embeddings, rating_predictions = self(features)

        # We compute the loss for each task.
        rating_loss = self.rating_task(
            labels=ratings,
            predictions=rating_predictions,
        )
        retrieval_loss = self.retrieval_task(user_embeddings, book_embeddings)

        # And combine them using the loss weights.
        return (self.rating_weight * rating_loss

                + self.retrieval_weight * retrieval_loss)

tansaku commented 3 years ago

thanks for sharing all - much appreciated - I'll see if I can replicate

For my task I have:

task = tfrs.tasks.Retrieval(
    metrics=tfrs.metrics.FactorizedTopK(
       interests.batch(128).map(interest_model)
  )
)

and for the search I have:

index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index(interests.batch(100).map(model.interest_model), interests)

I'll clean up my code and share the full set next week - but mine is basically a copy of the TensorFlow/recommenders/docs/examples/quickstart.ipynb

GaetanDu commented 3 years ago

Yes, it will be easier to debug if you share your code

tansaku commented 3 years ago

thanks @GaetanDu - so here's what I'm doing that I just ran in a fresh notebook. I've added a step to ensure the vocabularies are being built from a unique set of elements

from typing import Dict, Text

import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

import pandas as pd

# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

import matplotlib.pyplot as plt

interest_train = pd.read_csv("user-interests.csv",header=0)
interest_train = interest_train.assign(interests_c=interest_train['interest_list'].str.split(';')).explode('interest_list').reset_index(drop=True)

training_dataset = (
    tf.data.Dataset.from_tensor_slices(
        (
            tf.cast(interest_train['interest_list'].values, tf.string),
            tf.cast(tf.as_string(interest_train['user_ident'].values), tf.string)
        )
    )
)

adjusted_training_dataset = training_dataset.map(lambda x,y: {
    "interest": x,
    "user_id": y
})

interests = training_dataset.map(lambda x,y: x)
users = training_dataset.map(lambda x,y: y)

unique_interests = set()
for i in interests.take(25000):
  unique_interests.add(i.numpy())

unique_users = set()
for i in users.take(25000):
  unique_users.add(i.numpy())

users = tf.data.Dataset.from_tensor_slices(list(unique_users))
user_ids_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(users)

interests = tf.data.Dataset.from_tensor_slices(list(unique_interests))
interests_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
interests_vocabulary.adapt(interests)

class InterestModel(tfrs.Model):
  # We derive from a custom base class to help reduce boilerplate. Under the hood,
  # these are still plain Keras Models.

  def __init__(
      self,
      user_model: tf.keras.Model,
      interest_model: tf.keras.Model,
      task: tfrs.tasks.Retrieval):
    super().__init__()

    # Set up user and interest representations.
    self.user_model = user_model
    self.interest_model = interest_model

    # Set up a retrieval task.
    self.task = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # Define how the loss is computed.

    user_embeddings = self.user_model(features["user_id"])
    interest_embeddings = self.interest_model(features["interest"])

    return self.task(user_embeddings, interest_embeddings)

  # Define user and interest models.
user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(user_ids_vocabulary.vocab_size(), 64)
])
interest_model = tf.keras.Sequential([
    interests_vocabulary,
    tf.keras.layers.Embedding(interests_vocabulary.vocab_size(), 64)
])

# Define your objectives.
task = tfrs.tasks.Retrieval(
    metrics=tfrs.metrics.FactorizedTopK(
       interests.batch(128).map(interest_model)
  )
)

# Create a retrieval model.
model = InterestModel(user_model, interest_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))

# Train for 3 epochs.
model.fit(adjusted_training_dataset.batch(4096), epochs=3)

# Use brute-force search to set up retrieval using the trained representations.
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index(interests.batch(1000).map(model.interest_model), interests)

so running this:

Epoch 1/3
6/6 [==============================] - 1s 107ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.9972 - factorized_top_k/top_100_categorical_accuracy: 1.0000 - loss: 34003.7478 - regularization_loss: 0.0000e+00 - total_loss: 34003.7478
Epoch 2/3
6/6 [==============================] - 1s 106ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.1905 - factorized_top_k/top_10_categorical_accuracy: 0.3515 - factorized_top_k/top_50_categorical_accuracy: 0.9714 - factorized_top_k/top_100_categorical_accuracy: 1.0000 - loss: 32277.0315 - regularization_loss: 0.0000e+00 - total_loss: 32277.0315
Epoch 3/3
6/6 [==============================] - 1s 109ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0352 - factorized_top_k/top_5_categorical_accuracy: 0.6068 - factorized_top_k/top_10_categorical_accuracy: 0.7307 - factorized_top_k/top_50_categorical_accuracy: 0.9264 - factorized_top_k/top_100_categorical_accuracy: 1.0000 - loss: 31529.4378 - regularization_loss: 0.0000e+00 - total_loss: 31529.4378

I do now get a set of recommendations in a list of 10 where each recommendation is different from the other, e.g. ['food', 'swimming', 'flowers', ...] (which is an improvement over my original where it was like ['food,'food','food',...]) but even with a variety of interests recommended, each individual user gets the same recommendations in the same order ...

tansaku commented 3 years ago

I wonder if my data just doesn't have the right distribution to work with this approach ...

GaetanDu commented 3 years ago

Hello, are you sure it is a bad recommendation? I mean i have the same recommendations when i'm trying on new user_id are you ? I suppose this is due to that new people has the same embedding

tansaku commented 3 years ago

it seems strange that they should all be identical. Each user has a different combination of interests, so I would have thought that on that basis there should be some variation between users.

My suspicion is that this approach (for this number of epochs, learning rate and network size) relies on there being sufficient numbers of items "rated" by each user. When we reduce to a much smaller movielens dataset where each user has only rated a single movie we get the same behaviour, i.e. same set of recommendations for all users

williamberrios commented 3 years ago

Hi @tansaku, I am having a similar problem. It happens that even though I am overfitting to my validation and training data, my predictions are the same for all users. I see in your output that you have the same principle. Did you solve it? I will appreciate your opinion. Maybe @maciejkula can give us his thought on this :)

tansaku commented 3 years ago

hi @williamberrios thanks for reaching out. I did not solve it yet. At the moment I am trying to visualize the weights via tensorboard, and trying to work out if there is some setting of the hyperparameters or smaller network size that can fix it.

I did get improvements from following @GaetanDu 's suggestions, and interesting that it worked for him for his dataset. I suspect something about the nature of our data distributions is causing this. Can you share anything about the distribution of your data?

I did also reach out to @maciejkula to ask for his input - would be great to better understand what about the data and network architecture allows this to work in some cases and not others

maciejkula commented 3 years ago

Lots of things could be going wrong here.

When constructing the candidates dataset you pass to the FactorizedTopK layers, you must include each unique candidate once.
Make sure your vocabularies are working correctly. Are you sure that all of your users are not mapped to the same OOV bucket?
Are there numerical problems? Is the model over-regularized, and all embeddings tend to zero?

williamberrios commented 3 years ago

hi @tansaku , in my case I'm following also the tutorials with my own dataset (basically pretty similar to movies). The pattern, I have seen is that when my network doesn't overfit, it produces different scores and recommendations for all users. However when overfits, in the training and validation sets, It produces equal scores and predictions for all users which I think is counterintuitive because if I'm overfitting it should give me the same predictions as my original dataset. I could be wrong though.

tansaku commented 2 years ago

hi @williamberrios I'm not sure what's intuitive or not at this point :-) I know that when I massively reduce the size of the MovieLens data set I get increasingly repetitive output from the recommender. At very low levels every single recommendation is the same, as I increase the data amounts the output becomes more varied. With other things held equal that would correspond to overfitting.

I tried training my data on just one epoch, but still the same problem. How are you measuring degree of overfit?

I'm trying to work out if there is another way to reduce the number of params in the model ...

tansaku commented 2 years ago

@maciejkula - thanks so much for the advice - apologies I had some how missed your comment :-)

Taking each of your points in turn:

When constructing the candidates dataset you pass to the FactorizedTopK layers, you must include each unique candidate once.

print("interests are unique:")
print(len(unique_interests) == len(set(unique_interests)))
print(random.choice(tuple(unique_interests)))

interests are unique:
True
b'Beer'

print("users are unique:")
print(len(unique_users) == len(set(unique_users)))
print(random.choice(tuple(unique_users)))

users are unique:
True
b'200001234567'

My data has precisely 75 interests and 4866 users, each of which are unique. There are 24540 mappings, each in the form:

{'interest': <tf.Tensor: shape=(), dtype=string, numpy=b'Books'>, 'user_id': <tf.Tensor: shape=(), dtype=string, numpy=b'990000123456'>}

Make sure your vocabularies are working correctly. Are you sure that all of your users are not mapped to the same OOV bucket?

I don't think so. If I print the vocabularies I see this:

print(user_ids_vocabulary.get_vocabulary())

['[UNK]', '99000012345', '99000012346', '99000012347', '99000012348' ...

print(interests_vocabulary.get_vocabulary())

['[UNK]', 'World traveller', 'Winter sports', 'Wine', ...

I assume this indicates that they are not all in some out of vocabulary bucket ... as does this below

data = tf.constant([["990000123456"]])
print(user_ids_vocabulary(data))

tf.Tensor([[1]], shape=(1, 1), dtype=int64)

please do correct me if other aspects need to be checked

Are there numerical problems? Is the model over-regularized, and all embeddings tend to zero?

I've viewed the embeddings using tensorboard and everything seems nicely spread out. I can't immediately work out a simpler way of viewing a summary of the embeddings. I don't think we have any regularization - you can see the full code earlier in this thread.

user_ids_embedding = tf.keras.layers.Embedding(user_ids_vocabulary.vocab_size(), 16)
tf.print(user_ids_embedding.input_dim)
tf.print(user_ids_embedding.output_dim)
tf.print(user_ids_embedding.embeddings_regularizer)
tf.print(user_ids_embedding.activity_regularizer)
tf.print(user_ids_embedding.embeddings_constraint)

4867
16
None
None
None

interests_embedding = tf.keras.layers.Embedding(interests_vocabulary.vocab_size(), 16)
tf.print(interests_embedding.input_dim)
tf.print(interests_embedding.output_dim)
tf.print(interests_embedding.embeddings_regularizer)
tf.print(interests_embedding.activity_regularizer)
tf.print(interests_embedding.embeddings_constraint)

76
16
None
None
None

I guess our main control on the number of parameters in the model is the dimension on the embedding? I've tried walking these down to as low as 2, but to no effect. I've tried training with single epochs, and with 100 epochs, but the results are always the same - same predicted interests for all users ...

Maybe it's due to some aspect of the distribution of my data, or some other silly mistake in the code ... I feel like I need to run with a very very small sample and print out all the weights/params to better understand what's happening ...

tansaku commented 2 years ago

some other relevant data is the model weights, which look okay I think?

[<tensorflow.python.keras.engine.base_layer_utils.TrackableWeightHandler object at 0x7fba012ac6d0>, <tf.Variable 'embedding_9/embeddings:0' shape=(4867, 64) dtype=float32, numpy=
array([[ 0.049, -0.048,  0.036, ...,  0.013,  0.014,  0.009],
       [-1.676, -0.38 , -0.834, ...,  1.884,  1.316,  0.621],
       [ 0.683, -1.057,  2.111, ..., -0.691,  0.627,  0.105],
       ...,
       [-1.034, -2.202,  0.195, ..., -0.514, -0.506,  0.385],
       [ 0.515,  0.14 , -0.024, ..., -0.043, -0.638, -0.865],
       [ 0.303, -0.776, -1.835, ...,  0.462, -0.544, -1.108]],
      dtype=float32)>, <tensorflow.python.keras.engine.base_layer_utils.TrackableWeightHandler object at 0x7fb990ba5c40>, <tf.Variable 'embedding_10/embeddings:0' shape=(76, 64) dtype=float32, numpy=
array([[ 0.019, -0.027,  0.002, ..., -0.029, -0.031, -0.011],
       [ 0.104,  0.024,  0.357, ...,  0.041, -0.37 ,  0.307],
       [-0.247, -0.03 ,  0.221, ..., -0.008, -0.134, -0.028],
       ...,
       [-0.221,  0.01 ,  0.404, ...,  0.05 ,  0.016,  0.507],
       [-0.143, -0.034,  0.2  , ...,  0.196,  0.157,  0.346],
       [-0.187, -0.169,  0.347, ...,  0.557,  0.082, -0.141]],
      dtype=float32)>, <tf.Variable 'counter:0' shape=() dtype=int32, numpy=75>, <tf.Variable 'total:0' shape=() dtype=float32, numpy=2192.0>, <tf.Variable 'count:0' shape=() dtype=float32, numpy=24540.0>, <tf.Variable 'total:0' shape=() dtype=float32, numpy=10504.0>, <tf.Variable 'count:0' shape=() dtype=float32, numpy=24540.0>, <tf.Variable 'total:0' shape=() dtype=float32, numpy=15402.0>, <tf.Variable 'count:0' shape=() dtype=float32, numpy=24540.0>, <tf.Variable 'total:0' shape=() dtype=float32, numpy=23828.0>, <tf.Variable 'count:0' shape=() dtype=float32, numpy=24540.0>, <tf.Variable 'total:0' shape=() dtype=float32, numpy=24540.0>, <tf.Variable 'count:0' shape=() dtype=float32, numpy=24540.0>]

and then the index weights:

[<tensorflow.python.keras.engine.base_layer_utils.TrackableWeightHandler object at 0x7fba012ac6d0>, <tf.Variable 'embedding_9/embeddings:0' shape=(4867, 64) dtype=float32, numpy=
array([[ 0.049, -0.048,  0.036, ...,  0.013,  0.014,  0.009],
       [-1.676, -0.38 , -0.834, ...,  1.884,  1.316,  0.621],
       [ 0.683, -1.057,  2.111, ..., -0.691,  0.627,  0.105],
       ...,
       [-1.034, -2.202,  0.195, ..., -0.514, -0.506,  0.385],
       [ 0.515,  0.14 , -0.024, ..., -0.043, -0.638, -0.865],
       [ 0.303, -0.776, -1.835, ...,  0.462, -0.544, -1.108]],
      dtype=float32)>, <tf.Variable 'identifiers:0' shape=(75,) dtype=string, numpy=
array([b'Design', ... b'Gardening'], dtype=object)>, <tf.Variable 'candidates:0' shape=(75, 64) dtype=float32, numpy=
array([[ 0.046,  0.027,  0.257, ...,  0.331,  0.122,  0.241],
       [ 0.021, -0.103,  0.511, ...,  0.149, -0.179,  0.392],
       [ 0.141, -0.166,  0.565, ...,  0.123,  0.265,  0.401],
       ...,
       [-0.15 , -0.288,  0.189, ..., -0.085, -0.314,  0.24 ],
       [ 0.016,  0.164,  0.022, ..., -0.009, -0.047,  0.256],
       [-0.216,  0.102,  0.379, ...,  0.081,  0.281, -0.013]],
      dtype=float32)>]

it looks like there is sensible variation there - am I just querying the index incorrectly? I guess I should review the underlying code and perhaps run with a very small sample where I can see all the weights ... https://github.com/tensorflow/recommenders/blob/v0.5.2/tensorflow_recommenders/layers/factorized_top_k.py#L428-L553

maciejkula commented 2 years ago

What are your evaluation metrics? Are they good? Bad?

tansaku commented 2 years ago

you mean these?

Epoch 1/3 6/6 [==============================] - 1s 111ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0879 - factorized_top_k/top_5_categorical_accuracy: 0.2595 - factorized_top_k/top_10_categorical_accuracy: 0.3707 - factorized_top_k/top_50_categorical_accuracy: 0.8476 - factorized_top_k/top_100_categorical_accuracy: 1.0000 - loss: 40636.7500 - regularization_loss: 0.0000e+00 - total_loss: 40636.7500 Epoch 2/3 6/6 [==============================] - 1s 109ms/step - factorized_top_k/top_1_categorical_accuracy: 0.1163 - factorized_top_k/top_5_categorical_accuracy: 0.3784 - factorized_top_k/top_10_categorical_accuracy: 0.5273 - factorized_top_k/top_50_categorical_accuracy: 0.9253 - factorized_top_k/top_100_categorical_accuracy: 1.0000 - loss: 32419.3457 - regularization_loss: 0.0000e+00 - total_loss: 32419.3457 Epoch 3/3 6/6 [==============================] - 1s 111ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0893 - factorized_top_k/top_5_categorical_accuracy: 0.4280 - factorized_top_k/top_10_categorical_accuracy: 0.6276 - factorized_top_k/top_50_categorical_accuracy: 0.9710 - factorized_top_k/top_100_categorical_accuracy: 1.0000 - loss: 29677.4032 - regularization_loss: 0.0000e+00 - total_loss: 29677.4032

where we can see the loss dropping from epoch to epoch?

tansaku commented 2 years ago

I'm trying to better understand the precise operations. To that end I have made a version with 3 interests and 2 users from 3 mappings. I gave embedding output dimensions of 2, and that allows me to print all the weights in both the model and the index.

These are the mdoel weights:

<tensorflow.python.keras.engine.base_layer_utils.TrackableWeightHandler object at 0x7fe999901f70>
<tf.Variable 'embedding_4/embeddings:0' shape=(3, 2) dtype=float32, numpy=
array([[-0.001, -0.037],
       [-0.058,  0.031],
       [-0.019,  0.01 ]], dtype=float32)>, <tensorflow.python.keras.engine.base_layer_utils.TrackableWeightHandler object at 0x7fe9f9358a90>
<tf.Variable 'embedding_5/embeddings:0' shape=(4, 2) dtype=float32, numpy=
array([[-0.026, -0.028],
       [-0.004, -0.012],
       [-0.012,  0.029],
       [-0.03 , -0.02 ]], dtype=float32)>
<tf.Variable 'counter:0' shape=() dtype=int32, numpy=3>
<tf.Variable 'total:0' shape=() dtype=float32, numpy=0.0>
<tf.Variable 'count:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'total:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'count:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'total:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'count:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'total:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'count:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'total:0' shape=() dtype=float32, numpy=3.0>
<tf.Variable 'count:0' shape=() dtype=float32, numpy=3.0>

and these are the index weights:

<tensorflow.python.keras.engine.base_layer_utils.TrackableWeightHandler object at 0x7fe999901f70> <tf.Variable 'embedding_4/embeddings:0' shape=(3, 2) dtype=float32, numpy=
array([[-0.001, -0.037],
       [-0.058,  0.031],
       [-0.019,  0.01 ]], dtype=float32)>
<tf.Variable 'identifiers:0' shape=(3,) dtype=string, numpy=
array([b'Sustainability', b'Dance', b'Books'], dtype=object)>
<tf.Variable 'candidates:0' shape=(3, 2) dtype=float32, numpy=
array([[-0.004, -0.012],
       [-0.012,  0.029],
       [-0.03 , -0.02 ]], dtype=float32)>

As we are commonly seeing all recommendations are the same:

Top 3 recommendations for user 1: [[b'Books' b'Dance' b'Sustainability']]
thing: [[ 0.029  0.029 -0.054]]
Top 3 recommendations for user 2: [[b'Books' b'Dance' b'Sustainability']]
thing: [[ 0.029  0.029 -0.054]]

what I'd really like to understand is which matrix operation(s) give us this output [ 0.029 0.029 -0.054]

tansaku commented 2 years ago

I'm looking at the code in the layers.factorised_top_k.py which I think is generating the results:

scores = tf.linalg.matmul(queries, self._candidates, transpose_b=True)
values, indices = tf.math.top_k(scores, k=k)
return values, tf.gather(self._identifiers, indices)

I'm running the matmul in my own note book like so:

scores = tf.linalg.matmul(model.user_model(np.array(["1"])), index._candidates, transpose_b=True)
print(scores)

which gives

tf.Tensor([[ 0.045 -0.019 -0.033]], shape=(1, 3), dtype=float32)

tansaku commented 2 years ago

okay, this looks very suspicious:

query1 = model.user_model(np.array(["1"]))
print(query1)
query2 = model.user_model(np.array(["2"]))
print(query2)

tf.Tensor([[-0.001 -0.037]], shape=(1, 2), dtype=float32)
tf.Tensor([[-0.001 -0.037]], shape=(1, 2), dtype=float32)

so something must be wrong with the user model?

maciejkula commented 2 years ago

Please make sure your user ids are looked up in the vocabulary correctly, instead of all mapping to OOV. This is by far the most plausible explanation.

tansaku commented 2 years ago

right, I will dig in further, but I am using identical code to the example (for managing and looking up user ids in the vocabulary), and also when one runs the moveilens data on a smaller subset of the 10k we get the same behaviour with the quickstart code (all users get the same recommendations). I wonder if for some smaller datasets or particular distributions the user model fails to break symmetry ...?

When I say something wrong with user model I don't mean your underlying code is wrong - but more that I'm just not training it correclty, e.g. insufficient data. If there was a simple fix involving vocabulary lookup that would be great ... I will redouble my efforts to find such.

tansaku commented 2 years ago

okay fixed it - gosh so silly:

thing, titles = index(np.array([user_ids_vocabulary.get_vocabulary()[1]]))
print(f"Top 10 recommendations for user {user_ids_vocabulary.get_vocabulary()[1]}: {titles}")
print(f"thing: {thing}")
thing, titles = index(np.array([user_ids_vocabulary.get_vocabulary()[2]]))
print(f"Top 10 recommendations for user {user_ids_vocabulary.get_vocabulary()[2]}: {titles}")
print(f"thing: {thing}")
thing, titles = index(np.array([user_ids_vocabulary.get_vocabulary()[3]]))
print(f"Top 10 recommendations for user {user_ids_vocabulary.get_vocabulary()[3]}: {titles}")
print(f"thing: {thing}")

Top 10 recommendations for user 990000123458: [[b'Graphic design' b'Public relations' b'Sculpture' b'Live music'
  b'Sustainable' b'Religion' b'Menswear' b'TV' b'Musical' b'Dance']]
thing: [[2.602 2.331 2.126 1.528 1.526 1.361 1.287 1.059 0.997 0.82 ]]
Top 10 recommendations for user 990000123457: [[b'Personal Health' b'Beach holidays' b'Mindfulness' b'Gardening'
  b'Volunteering' b'Outdoor activities' b'Design' b'Mental Health'
  b'Adventure breaks' b'Film']]
thing: [[3.81  3.46  3.331 2.941 2.929 2.679 1.888 1.834 1.709 1.66 ]]
Top 10 recommendations for user 990000123456: [[b'Religion' b'Comedy' b'Musical' b'Live music' b'Investment' b'Couture'
  b'Comics' b'Education' b'Dance' b'Mindfulness']]
thing: [[4.629 4.528 3.428 2.972 2.891 2.792 2.374 2.37  2.327 2.056]]

I had assumed that in example the queries as:

np.array(["1"])

represented the 1st user, when this is actually representing user id "1". I just had to look up the correct user ids e.g.

np.array(["990000123456"])

and we get the expected behaviour ... but this has been very educational in understanding a lot more about the system. I'm still not quite seeing which matrix multiplications are leading to particular outputs, but I'm well on the way.

@GaetanDu your input about the uniqueness in the vocabulary was critical @maciejkula thanks for your input on this and for making the whole framework available

I wonder if it would be worth adding a note to the quick start docs about the user-ids? or perhaps having the first lookup be for a user id that couldn't be confused with an index?

tansaku commented 2 years ago

put in a tiny PR to highlight the nature of the user id lookup https://github.com/tensorflow/recommenders/pull/347

tansaku commented 2 years ago

and just to summarise my understanding now that I've run on a small super simple example, with embeddings using k=2 after 3 epochs training:

The user id embedding maps user ids onto 2 dims, e.g.

[ 0.016, -0.027] ==> UNK
[ 0.454, -0.095] ==> 123456
[-0.44 ,  0.132] ==> 456788

The interest (in my data set) embedding maps interests onto 2 dims

[ 0.258, -0.051] ==> Dance
[ 0.262, -0.019] ==> Books
[-0.519,  0.144] ==> Sustainability

Then calculating suggested interests involves selecting a user, and doing a matrix multiplication to assess level of interest in each subject, which can then be ranked ...

The weights get randomly initialized to small values, and after training the weights for the UNK items are the same, but the other weights have increased by approximately an order of magnitude ... and I guess in terms of explainability we have that embedding dimensions representing some feature of the data, so for example we might say that in the above example that user 123456 is 0.454 in terms of feature A and -0.095 in terms of feature B and that Dance and Books are more feature A than B and vice versa for sustainability, so user 123456 matches Dance and Books more based on a feature A "connection" ...

if that's roughly correct I want to move on to better understanding the objective function tfrs.tasks.Retrieval ...

tansaku commented 2 years ago

@williamberrios was using indices instead of ids your problem too?

jasonlevine commented 1 year ago

Hi all, reopening this thread because I'm finding that using the basic_ranking tutorial with the movielens100k dataset produces identical movie rankings for every user. The predicted ratings vary from user to user, but the overall order remains the same. I tried with the movielens1m dataset and found that while the rankings now varied slightly, the same 5 movies were predicted to be in every user's top 5. I tried this model with my own dataset of 200k users and 5k items and found the exact same problem. It seems like the model is collapsing to the mean and becoming a variant of a "most popular recommender" that takes ratings into account.

I've tried varying the learning rate, layer regularization, dropout, reducing to one layer, and even removing the activation function to essentially make it a linear model. Regardless, it outputs the same ranking for every user.

However, I tried running all of these datasets through various matrix factorization based collaborative filtering algos like Alternating Least Squares and Bayesian Personalized Ranking and found that each user had unique and relevant recommendations, so I don't think its a dataset problem.

Finally, I am referencing user_ids directly from unique_user_ids so it's not an indices problem.

Any insight is appreciated 🙏 Jason