Closed daviddavo closed 1 month ago
@cclinet @Qcactus Can you check this? If you cannot find a time to handle this, let me know.
I'm not certain under what circumstances 'user in test' should be used in 'interact_status.' Based on my understanding, 'interact_status' should only be applied to the training dataset. Could you provide me with more information on this?
Basically, to test how a system would work in real-life cold-start conditions.
I've become a bit unclear about the logic here, but I don't believe it's necessary for 'interact_status' to include users from the test set. @loomlike Do you have time to further investigate this issue?
The problem is that self.n_user does include the users in test. The sampling, therefore, also includes these "test users" that might not be available in training.
indices = range(self.n_users)
users = random.sample(indices, batch_size)
Another problem to remember is that the definition of epoch usually depends on the number of elements in the train set (users in this case), and should not change with the number of items in train. That's why my proposed solution in 2023 was wrong. With all of this in mind, we could:
Note: With remaining users I refer to users in test that are not in training
- Add a boolean mask of size n_users that tells whether each user is in train or not
- Add a second n_users called n_train_users, and make all remaining users have an index value greater than n_train_users, then sample only from 0 to n_train_users
About the second option, it already concats first the train and then the test, and then drops the duplicates keeping just the first one, so every user in train will be in the first part of the user_idx.
Yeah, the second option seems to work fine and I just changed two lines of code
https://gist.github.com/daviddavo/92a79db3d94bc23e8cdb03279475a221
@daviddavo should this issue be closed now?
Yep, I thought it was automatically closed with #2117
Description
ImplicitCF raises an
IndexError
if the user appears in the test dataset but not on the training dataset.How do we replicate the issue?
Split a dataset using a method like TimeSeriesSplit or python_chrono_split. I.e:
len(ImplicitCF.interact_status) < len(ImplicitCF.user_idx)
Expected behavior (i.e. solution)
Raisign a meaningful error if the dataset needs to be stratified, or assuming that if the user is not on the
ImplicitCF.interact_status
table, it should have the empty set of items.Other Comments
Meanwhile, I solved it by using:
This will create a the remaining "empty" users
Or just deleting items in test that don't appear in train