Question: Is the the Train/Test Split in "Building deep retrieval models" guide appropriate?

tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.

Apache License 2.0

1.83k stars 274 forks source link

The building deep retrieval models article splits the dataset like this:

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()

The dataset consists of user_id and movie_titles pairs. This means the current data splitting approach in the guide can put the same user_id into both test and train datasets.

Hypothetical Example

User: John Doe,
Movie's watched: "Amazig Movie", "Data is Cool"  

After Data splitting, we could see something like this:
Train: John Doe, "Amazing Movie
Test: John Doe, "Data is Cool"

Wouldn't the model generalize better if we split by unique user_id instead?

ex: Train: 80% user_ids and the corresponding movie_title pairs Test: The other 20% of user_ids and the corresponding movie_title pairs

tensorflow / recommenders

Question: Is the the Train/Test Split in "Building deep retrieval models" guide appropriate? #246