Open edwardbernays opened 3 years ago
A better split would be across time: train on a set of observations up to some 3 days ago, test on the remaining three days.
The test you suggest would not work for simple user/item embedding only models. If you test set only had users you have never seen in your training data, a simple matrix factorization model would be no better than random - how could it learn anything about the test set users if it's never seen them?
This is where more generalizable models come in - models that use more general user and item features (rather than just ids) to generalize to unseen users and items.
The building deep retrieval models article splits the dataset like this:
The dataset consists of user_id and movie_titles pairs. This means the current data splitting approach in the guide can put the same user_id into both test and train datasets.
Hypothetical Example
Wouldn't the model generalize better if we split by unique user_id instead?
ex: Train: 80% user_ids and the corresponding movie_title pairs Test: The other 20% of user_ids and the corresponding movie_title pairs