Open nicewenhui opened 11 months ago
I guess hashing have two main benefits, one that new items gets mapped to different hashing buckets (so not all new items will be treated the same), and having a collision can have a regularization effect, specifically if it's not too much collisions, and the fact that usually we have sparse data for rare items so dedicating a single embedding for them may overfit.
For the user_ids, You can represent the user id by the item_ids the user is consuming, to avoid retraining the model for each new user_id, if the cataloge is more stable, you'll have a more stable model representing the user id by selected features out of his previous behavior
I have developed a retrieval model for personalized movie recommendations. However, in the real world, new users and new content continue to emerge. To address this challenge, I have learned about the benefits of using hashing embedding.
In tutorial, I found the hashing layer was putted as part of the model architecture. why this can avoid retraining the model every time? Besides, I don't know how to handle hashing collisions and determine the appropriate value for the num_bins parameter. In the provided example, even with only 5 inputs and setting num_bins to 6, 2 values (['b'],['c'] ) were still hashed to the same bin.
In my real codes, for example, I have 10000 user_id before, and each day will have around 1000 new users, how should I set the num_bins to ensure each user has their unique hashed code? How about calculating the total number of users each day and setting the num_bins parameter to the number of users for that specific day? Will the old users still have the same hashed codes as before?
Thanks in advance.