maciejkula / spotlight

Deep recommender models using PyTorch.
MIT License
2.99k stars 423 forks source link

user_based_train_test_split broken? #77

Closed ddofer closed 6 years ago

ddofer commented 6 years ago

Currently trying with latest master branch to downsample the goodreads dataset by user:

from spotlight.cross_validation import random_train_test_split, user_based_train_test_split
from spotlight.datasets.goodbooks import get_goodbooks_dataset
dataset = get_goodbooks_dataset()

dataset.ratings = dataset.ratings.astype(np.float32)
train, test = spotlight.cross_validation.user_based_train_test_split(dataset,test_percentage=0.7)
print(train)

Train is: <Interactions dataset (53425 users x 10001 items x 1793373 interactions)> The full dataset is: <Interactions dataset (53425 users x 10001 items x 5976479 interactions)>

i.e the sample only downsampled interactions, not users.

maciejkula commented 6 years ago

It is working correctly, although I can see what could be confusing.

user_based_test_split splits the dataset into two sets that are disjoint user-wise: that is, there are no users who have interactions in both sets.

Nevertheless, the dimensionality (that is, the number of users and items) remains the same. This is because factorization models require the number of users and items to be the same in both train and test sets.

This method of splitting the dataset is more appropriate for sequence-based models. For traditional factorization models, if you don't train on a given user, you won't be able to make predictions for them.

ddofer commented 6 years ago

Got it, thanks!

On Fri, Dec 8, 2017 at 12:43 AM, Maciej Kula notifications@github.com wrote:

It is working correctly, although I can see what could be confusing.

user_based_test_split splits the dataset into two sets that are disjoint user-wise: that is, there are no users who have interactions in both sets.

Nevertheless, the dimensionality (that is, the number of users and items) remains the same. This is because factorization models require the number of users and items to be the same in both train and test sets.

This method of splitting the dataset is more appropriate for sequence-based models. For traditional factorization models, if you don't train on a given user, you won't be able to make predictions for them.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maciejkula/spotlight/issues/77#issuecomment-350117452, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4hg_X9FJducvhG97ZaJebOWsq7LPrfks5s-GodgaJpZM4Q5PIu .

-- Dan Ofer - דן עופר Publications http://scholar.google.co.il/citations?hl=en&user=uDx2ItYAAAAJ

Photography http://picasaweb.google.com/ddofer http://500px.com/DanOfer