tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.83k stars 274 forks source link

Question: Is the the Train/Test Split in "Building deep retrieval models" guide appropriate? #246

Open edwardbernays opened 3 years ago

edwardbernays commented 3 years ago

The building deep retrieval models article splits the dataset like this:

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()

The dataset consists of user_id and movie_titles pairs. This means the current data splitting approach in the guide can put the same user_id into both test and train datasets.

Hypothetical Example

User: John Doe,
Movie's watched: "Amazig Movie", "Data is Cool"  

After Data splitting, we could see something like this:
Train: John Doe, "Amazing Movie
Test: John Doe, "Data is Cool"

Wouldn't the model generalize better if we split by unique user_id instead?

ex: Train: 80% user_ids and the corresponding movie_title pairs Test: The other 20% of user_ids and the corresponding movie_title pairs

maciejkula commented 3 years ago

A better split would be across time: train on a set of observations up to some 3 days ago, test on the remaining three days.

The test you suggest would not work for simple user/item embedding only models. If you test set only had users you have never seen in your training data, a simple matrix factorization model would be no better than random - how could it learn anything about the test set users if it's never seen them?

This is where more generalizable models come in - models that use more general user and item features (rather than just ids) to generalize to unseen users and items.