Closed ddofer closed 6 years ago
It is working correctly, although I can see what could be confusing.
user_based_test_split
splits the dataset into two sets that are disjoint user-wise: that is, there are no users who have interactions in both sets.
Nevertheless, the dimensionality (that is, the number of users and items) remains the same. This is because factorization models require the number of users and items to be the same in both train and test sets.
This method of splitting the dataset is more appropriate for sequence-based models. For traditional factorization models, if you don't train on a given user, you won't be able to make predictions for them.
Got it, thanks!
On Fri, Dec 8, 2017 at 12:43 AM, Maciej Kula notifications@github.com wrote:
It is working correctly, although I can see what could be confusing.
user_based_test_split splits the dataset into two sets that are disjoint user-wise: that is, there are no users who have interactions in both sets.
Nevertheless, the dimensionality (that is, the number of users and items) remains the same. This is because factorization models require the number of users and items to be the same in both train and test sets.
This method of splitting the dataset is more appropriate for sequence-based models. For traditional factorization models, if you don't train on a given user, you won't be able to make predictions for them.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maciejkula/spotlight/issues/77#issuecomment-350117452, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4hg_X9FJducvhG97ZaJebOWsq7LPrfks5s-GodgaJpZM4Q5PIu .
-- Dan Ofer - דן עופר Publications http://scholar.google.co.il/citations?hl=en&user=uDx2ItYAAAAJ
Photography http://picasaweb.google.com/ddofer http://500px.com/DanOfer
Currently trying with latest master branch to downsample the goodreads dataset by user:
Train is: <Interactions dataset (53425 users x 10001 items x 1793373 interactions)> The full dataset is: <Interactions dataset (53425 users x 10001 items x 5976479 interactions)>
i.e the sample only downsampled interactions, not users.