[BUG] ImplicitCF failing if not using stratified split

recommenders-team / recommenders

Best Practices on Recommendation Systems

https://recommenders-team.github.io/recommenders/intro.html

MIT License

18.81k stars 3.07k forks source link

[BUG] ImplicitCF failing if not using stratified split #2024

Closed daviddavo closed 1 month ago

daviddavo commented 11 months ago

Description

ImplicitCF raises an IndexError if the user appears in the test dataset but not on the training dataset.

How do we replicate the issue?

Split a dataset using a method like TimeSeriesSplit or python_chrono_split. I.e: len(ImplicitCF.interact_status) < len(ImplicitCF.user_idx)

Expected behavior (i.e. solution)

Raisign a meaningful error if the dataset needs to be stratified, or assuming that if the user is not on the ImplicitCF.interact_status table, it should have the empty set of items.

Other Comments

Meanwhile, I solved it by using:

data.interact_status = data.interact_status.reindex(data.user_idx['userID_idx'])
data.interact_status['userID'] = data.interact_status.index
data.interact_status['itemID_interacted'] = data.interact_status['itemID_interacted'].fillna("").apply(set)

This will create a the remaining "empty" users

Or just deleting items in test that don't appear in train

loomlike commented 10 months ago

@cclinet @Qcactus Can you check this? If you cannot find a time to handle this, let me know.

cclinet commented 10 months ago

I'm not certain under what circumstances 'user in test' should be used in 'interact_status.' Based on my understanding, 'interact_status' should only be applied to the training dataset. Could you provide me with more information on this?

daviddavo commented 10 months ago

Basically, to test how a system would work in real-life cold-start conditions.

cclinet commented 10 months ago

I've become a bit unclear about the logic here, but I don't believe it's necessary for 'interact_status' to include users from the test set. @loomlike Do you have time to further investigate this issue?

daviddavo commented 2 months ago

The problem is that self.n_user does include the users in test. The sampling, therefore, also includes these "test users" that might not be available in training.

  indices = range(self.n_users)
  users = random.sample(indices, batch_size)

Another problem to remember is that the definition of epoch usually depends on the number of elements in the train set (users in this case), and should not change with the number of items in train. That's why my proposed solution in 2023 was wrong. With all of this in mind, we could:

Note: With remaining users I refer to users in test that are not in training

Add a boolean mask of size n_users that tells whether each user is in train or not

Add a second n_users called n_train_users, and make all remaining users have an index value greater than n_train_users, then sample only from 0 to n_train_users

About the second option, it already concats first the train and then the test, and then drops the duplicates keeping just the first one, so every user in train will be in the first part of the user_idx.

daviddavo commented 2 months ago

Yeah, the second option seems to work fine and I just changed two lines of code

https://gist.github.com/daviddavo/92a79db3d94bc23e8cdb03279475a221

anargyri commented 1 month ago

@daviddavo should this issue be closed now?

daviddavo commented 1 month ago

Yep, I thought it was automatically closed with #2117