tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.83k stars 274 forks source link

Incremental retrieval model training with Hashing method #703

Open nicewenhui opened 11 months ago

nicewenhui commented 11 months ago

I have developed a retrieval model for personalized movie recommendations. However, in the real world, new users and new content continue to emerge. To address this challenge, I have learned about the benefits of using hashing embedding.

In tutorial, I found the hashing layer was putted as part of the model architecture. why this can avoid retraining the model every time? Besides, I don't know how to handle hashing collisions and determine the appropriate value for the num_bins parameter. In the provided example, even with only 5 inputs and setting num_bins to 6, 2 values (['b'],['c'] ) were still hashed to the same bin.

layer = layer = tf.keras.layers.Hashing(num_bins=6)
inp = [['a'], ['b'], ['c'], ['d'], ['e']]
layer(inp)

<tf.Tensor: shape=(5, 1), dtype=int64, numpy=
array([[3],
       [4],
       [4],
       [5],
       [1]])>

In my real codes, for example, I have 10000 user_id before, and each day will have around 1000 new users, how should I set the num_bins to ensure each user has their unique hashed code? How about calculating the total number of users each day and setting the num_bins parameter to the number of users for that specific day? Will the old users still have the same hashed codes as before?

Thanks in advance.

OmarMAmin commented 8 months ago

I guess hashing have two main benefits, one that new items gets mapped to different hashing buckets (so not all new items will be treated the same), and having a collision can have a regularization effect, specifically if it's not too much collisions, and the fact that usually we have sparse data for rare items so dedicating a single embedding for them may overfit.

OmarMAmin commented 8 months ago

For the user_ids, You can represent the user id by the item_ids the user is consuming, to avoid retraining the model for each new user_id, if the cataloge is more stable, you'll have a more stable model representing the user id by selected features out of his previous behavior