NCF - How data generated for train and test?

hahmad2008 commented 3 years ago

Hi, For NCF, there are two points I want to ask:

Split train, test data, as I see in this repo: they do split as follows:
- For each user, we held out his/her latest interaction as the test set and utilized the remaining data for training. We use python_chrono_split to achieve this
- So the split is based on each user interaction, not random for all users.
In preparing the dataset, negative and positive samples are generated, so what happens in the NFC in libRecommender?

Thank you in advance

massquantity commented 3 years ago

There are multiple ways of splitting data in this library, and you can find some description in User Guide. The python_chrono_split you mentioned is equivalent to split_by_ratio_chrono in LibRecommender.

You can choose generating negative samples or not in LibRecommender by simply call data.build_negative_samples(data_info). Many examples can demonstrate how to use it.

hahmad2008 commented 3 years ago

@massquantity Thank you I will check and get back to you. I am just asking because when I run split function from both librecommender and the other repo. it took a long time on the other repo not like in the librecommender (it is very fast).

hahmad2008 commented 3 years ago

@massquantity I have the following notes:

1- There is a parameter for negative samples in NCF: num_neg, so this parameter used for generating negative samples for training, then we don't have to use data.build_negative_samples(data_info) for the training data, right?

I only need to generate negative samples for test data data.build_negative_samples(data_info) , right?

2- For Recommender, they used the following metrics which are best fit for any recommender,

Mean Average Precision (map)
precision
recall However, in rating algorithms in libRecommender, we only stick with these metrics for fit and evaluate the recommender (rmse, mae, r2).

massquantity commented 3 years ago

No, you typically need to generate negative samples for both training and test data. The reason of generating negative samples for test data is evaluating. One can't get a meaningful evaluation result without negative sampling.
Mean Average Precision, precision and recall are ranking metrics, so they are only suitable for ranking tasks. For rating tasks, the suitable metrics are rmse, mae and r2. If you want to do rating task, you shouldn't use the code in Recommender repo.
Since you mentioned that the data split function is slow in Recommender repo, I've looked into their code. It is a known issue that in most cases pandas is much slower than numpy, and they basically use pure pandas functions to split data. In contrast, LibRecommender always tries to use numpy for high speed.

hahmad2008 commented 3 years ago

@massquantity Thank you for your answer. For NCF in Recommender, I can't initiate NCF instance data = NCFDataset(train=train, test=test, seed=SEED), it is not only slow but it is also memory consuming for me for 1.6 millions user

Maybe they generate negative samples for train, test using these parameters, and use pandas instead of NumPy. n_neg=4, n_neg_test=100,

To get the same result as in Recommender for NCF, So I need to generate negative training and testing samples using data.build_negative_samples(data_info) before calling fit model, right? and using the same numbers they used.?

hahmad2008 commented 3 years ago

@massquantity For loss, we can also evaluate the recommendation using MAP as we do in ranking. so we don't only predict the same seen items for the user with different rates, but also new items which the user doesn't see.

massquantity commented 3 years ago

Your data has 1.6 million users or 1.6 million records? That makes quite a difference. If you have 1.6 million users, the whole data may be large, easily exceeding 100 million records, I suppose.

hahmad2008 commented 3 years ago

@massquantity sorry I meant 1.6. million records from data-frame, including:

Number of users: 82k
Number of items: 32k So it can't initiate the NCF dataset from this dataframe. it needs more than 100G memory. Is that normal? due to only using panadas instead of numpy?

massquantity commented 3 years ago

Did you get all the column names right? If the column names are correct, then I guess the problem lies in their way of processing data. I think the problem comes from line 146 - 155 in NCF dataset.

self.item_pool = set(self.train[self.col_item].unique())        
self.interact_status = (self.train.groupby(self.col_user)[self.col_item]
    .apply(set)           
    .reset_index()           
    .rename(columns={self.col_item: self.col_item + "_interacted"})
)
self.interact_status[self.col_item + "_negative"] = self.interact_status[
    self.col_item + "_interacted"
].apply(lambda x: self.item_pool - x)

item_pool contains all of the items, which in your case is 32k. So based on the code, they assign all the items for every user, and that's a matrix of 82k x 32k, not to mention other features. You can try yourself to see how much memory does it cost by calling np.zeros((82000, 32000)).

massquantity commented 3 years ago

Besides, they use set instead of numpy.array for all the items of every user, and set costs way more memory than the compact numpy.array. Also be aware that in the notebook they use ml-100k dataset, which has only 943 users and 1682 items.

hahmad2008 commented 3 years ago

@massquantity Thank you for this explanation. But is reindexing user/item and generate negative samples needs to generate the whole n x m matrix! Weird! this reminds me of traditional memory-based collaborative filtering where it needs to build the n x m matrix.

For NCF in LibRecommender, is it the same NCF in Recommender repo? in terms of algorithm and how to initiate dataset with negative sampling and python_chrono_split splitting?

Because define NCF dataset in LibRecommender is really more memory efficient than the Recommender repo.

hahmad2008 commented 3 years ago

@massquantity I think the difference in evaluation metric used, for Recommender repo consider the problem as ranking so it used MAP as a loss. but for LibRecommender used MSE as is a regular regression problem.

massquantity commented 3 years ago

Reindexing and negative samples are fine. The real problem is the last line apply(lambda x: self.item_pool - x) . item_pool contains all the items, and item_pool - x means remove items a user has previously interacted. Considering a user consumes only about 20 items in a typical data, then they store nearly all the item for every user.
In LibRecommender, all the items are stored only once instead of for every user, so their memory usage is 82k larger than that in LibRecommender.

massquantity commented 3 years ago

I don't have time to look into every single line of their code, but I think the implementations are roughly the same. You can also deal with ranking problem and set MAP as metrics in LibRecommender. Just set the task="ranking" and metrics=["map"].

hahmad2008 commented 3 years ago

Thank you @massquantity for your time. I will check that

hahmad2008 commented 3 years ago

@massquantity How can I check the split_by_ratio_chrono after splitting? for example how to get items and labels for a specific user. from the original data we can take it like this: data[data['user']==id] . we can't do that for train_data and eval_data

hahmad2008 commented 3 years ago

@massquantity If I convert the users/ items ids as following, can I guarantee that model.predict(user=id) id in data is the same as id for mode prediction and recommendation?

items = list(df.item.unique())
users = list(df.user.unique())

user_dict = {item:ind for ind,item in enumerate(users)}
items_dict = {item:ind for ind,item in enumerate(items)}

data=df.copy()
data['user'] = data['user'].map(user_dict)
data['item'] = data['item'].map(items_dict)

massquantity commented 3 years ago

After calling train_data, eval_data = split_by_ratio_chrono, train_data and eval_data are all DataFrames just like the original data. So I don't get it, why you can't get items and labels for a user.

massquantity commented 3 years ago

Well, I don't recommend you doing this id-mapping thing. To ensure smooth code running for users, LibRecommender applies a couple of special processings.

The source code of id-mapping is something like this:

unique_users = np.sort(df.user.unique())
unique_items = np.sort(df.item.unique())
df["user"] = np.searchsorted(unique_users, df.user)
df["item"] = np.searchsorted(unique_items, df.item)

np.searchsorted uses binary search to find indices, which has O(lgN) time complexity. You use a dict to do directing mapping, then the complexity is O(N). This doesn't make much difference for your 1.6 million data, but for bigger data it may slow down the whole process.

Another problem is the users and items that only appear in eval_data or test_data. For these users and items, you can't do direct prediction because they are not trained. To deal with this problem, they are all mapped into a same id and are treated as cold-start users/items in LibRecommender. So the way you do the id-mapping will not get what you want.

Finally, if you want to use the mapped id instead of the originl user/item id to do prediction, you should pass the argument inner_id=True.

>>> model.predict(..., inner_id=True)

From my experiences, this kind of id-mapping thing is tedious and error-prone. As long as your mapped ids are all in the allowed range, no error or exception will occur, and this is dangerous. Because even if you didn't do it right, it's difficult to identify the problem immediately. So I try my best to encapsulate this process into the pipeline in LibRecommender, then there is no need for users to worry about this kind of thing.

hahmad2008 commented 3 years ago

Thanks @massquantity So after converting the datafame user id into these mapped ones, I can guarantee that the same user id value is the same when I do prediction using the trained model.

unique_users = np.sort(df.user.unique()) unique_items = np.sort(df.item.unique()) df["user"] = np.searchsorted(unique_users, df.user) df["item"] = np.searchsorted(unique_items, df.item)

get prediction for user id =10 (as it is in df['user']): model.recommend_user(user=10)

Regarding this point:

Another problem is the users and items that only appear in eval_data or test_data If the user does not appear in trained data, it will be dealt with as a cold start situation?

massquantity commented 3 years ago

For example, suppose the train_data has original item ids: 1, 3, 5, 7, 9, and the test_data has original item ids: 2, 3. In LibRecommender, item 2 will be excluded at first, since it didn't appear in train data. The mapping will become: 1 -> 0, 3 -> 1, 5 -> 2, 7 -> 3, 9 -> 4. Finally the item 2 will be mapped into the last index + 1, i.e. 2 -> 5.

However, if you do the id-mapping using the whole data, the mapping will become 1 -> 0, 2 -> 1, 3 -> 2, 5 -> 3, 7 -> 4, 9 -> 5, which is totally different from the mapping in LibRecommender. Although for now I don't think this is wrong, it's just that different ways of id-mapping are not compatible. So you still can't get that guarantee.

hahmad2008 commented 3 years ago

Thanks, @massquantity So for mapping the original data ids with the LibRecommender model, I need to consider only the ids in the train_data, is that right?

If so how to remap these ids to use the original ids in the prediction.

As an example, you supposed the mapping in the train data, but I can't export / map these ids with the original ids.

we need to do the following code to only the train_data ids??

unique_users = np.sort(df.user.unique()) 
unique_items = np.sort(df.item.unique())
df["user"] = np.searchsorted(unique_users, df.user) 
df["item"] = np.searchsorted(unique_items, df.item)

massquantity commented 3 years ago

The mappings for train_data are all stored in data_info.

train_data, data_info = DatasetFeat.build_trainset(train_data, user_col, item_col, sparse_col, dense_col).

data_info.user2id is a dict, which maps original user ids to mapped ids, and data_info.item2id has similar meaning. data_info.id2user maps mapped ids to orginal user ids.

Remember that by default model.predict and model.recommender_user all use the original id. If you want to use mapped ids, set inner_id=True.

massquantity / LibRecommender

NCF - How data generated for train and test? #61