recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
19.23k stars 3.1k forks source link

[ASK] In NCF Deep dive and ncf_movielens notebook, I used my own dataset instead of movie lens, its has userID itemID and ratings (i used counts here as rating like implicit data). The notebook throws the following error? could someone help me out with this problem? #816

Closed karthikraja95 closed 5 years ago

karthikraja95 commented 5 years ago

Description

Other Comments

Data set looks like this

rating  userID  itemID

0 12 3468 3644 1 3 3816 3959 2 1 2758 2650 3 1 5056 1593 4 30 3029 192

When I run this cell in the notebook I got the following error

data = NCFDataset(train=train, test=test, seed=SEED)

Error:


TypeError Traceback (most recent call last)

in 1 SEED = 10 ----> 2 data = NCFDataset(train=train, test=test, seed=SEED) ~/Recommenders/reco_utils/recommender/ncf/dataset.py in __init__(self, train, test, n_neg, n_neg_test, col_user, col_item, col_rating, binary, seed) 59 # initialize negative sampling for training and test data 60 self._init_train_data() ---> 61 self._init_test_data() 62 # set random seed 63 random.seed(seed) ~/Recommenders/reco_utils/recommender/ncf/dataset.py in _init_test_data(self) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187 ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds) 6485 args=args, 6486 kwds=kwds) -> 6487 return op.get_result() 6488 6489 def applymap(self, func): ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in get_result(self) 149 return self.apply_raw() 150 --> 151 return self.apply_standard() 152 153 def apply_empty_result(self): ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_standard(self) 255 256 # compute the result using the series generator --> 257 self.apply_series_generator() 258 259 # wrap results ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_series_generator(self) 284 try: 285 for i, v in enumerate(series_gen): --> 286 results[i] = self.f(v) 287 keys.append(v.name) 288 except Exception as e: ~/Recommenders/reco_utils/recommender/ncf/dataset.py in (row) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187 TypeError: ("unsupported operand type(s) for -: 'float' and 'set'", 'occurred at index 854') ...
miguelgfierro commented 5 years ago

have you checked whether the types in your dataset and the types in movielens are the same?

FYI @AaronHeee

karthikraja95 commented 5 years ago

Hey @miguelgfierro, @AaronHeee

Thanks for the suggestion.

I checked the types in my dataset and its not as same as the movielens dataset. I changed the type to userID - numpy.int64, itemID - numpy.int64 and rating - numpy.float64 as it is in the movielens dataset, and I tried running the same code again, the error is still the same.

Also, I commented out the code for timestamp in the repo files, since I don't have timestamp in my dataset.

Now the dataset looks like this:

rating  userID  itemID

0 12.0 3468 3644 1 3.0 3816 3959 2 1.0 2758 2650 3 1.0 5056 1593 4 30.0 3029 192

And the Error:


TypeError Traceback (most recent call last)

in 1 SEED = 10 ----> 2 data = NCFDataset(train=train, test=test, seed=SEED) ~/Recommenders/reco_utils/recommender/ncf/dataset.py in __init__(self, train, test, n_neg, n_neg_test, col_user, col_item, col_rating, binary, seed) 59 # initialize negative sampling for training and test data 60 self._init_train_data() ---> 61 self._init_test_data() 62 # set random seed 63 random.seed(seed) ~/Recommenders/reco_utils/recommender/ncf/dataset.py in _init_test_data(self) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 #print('After Merge test_interact_status',test_interact_status) --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187 ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds) 6485 args=args, 6486 kwds=kwds) -> 6487 return op.get_result() 6488 6489 def applymap(self, func): ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in get_result(self) 149 return self.apply_raw() 150 --> 151 return self.apply_standard() 152 153 def apply_empty_result(self): ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_standard(self) 255 256 # compute the result using the series generator --> 257 self.apply_series_generator() 258 259 # wrap results ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_series_generator(self) 284 try: 285 for i, v in enumerate(series_gen): --> 286 results[i] = self.f(v) 287 keys.append(v.name) 288 except Exception as e: ~/Recommenders/reco_utils/recommender/ncf/dataset.py in (row) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 #print('After Merge test_interact_status',test_interact_status) --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187 TypeError: ("unsupported operand type(s) for -: 'float' and 'set'", 'occurred at index 854')
miguelgfierro commented 5 years ago

mmmm

can you try to add a fake timestamp to see if that's the reason why it breaks?

karthikraja95 commented 5 years ago

@miguelgfierro, @AaronHeee

I don' think, thats the reason. Although I tested out like you said. Instead of adding the timestamp in my dataset, I dropped the movielens timestamp column and made it similar like my dataset. And ran the code. Its working properly for movielens dataset.

Also I recloned the repo and didn't modify any code.

MoiveLens dataset from the notebook:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 3 columns): userID 100000 non-null int64 itemID 100000 non-null int64 rating 100000 non-null float64 dtypes: float64(1), int64(2) memory usage: 2.3 MB

train_ml, test_ml are the train and test data for movie lens dataset, using python_random_split function implemented in the code

When I run this

data = NCFDataset(train=train_ml, test=test_ml, seed=SEED) ----- NO ERROR

My Dataset:

<class 'pandas.core.frame.DataFrame'> Int64Index: 100000 entries, 0 to 99999 Data columns (total 3 columns): rating 100000 non-null float64 userID 100000 non-null int64 itemID 100000 non-null int64 dtypes: float64(1), int64(2) memory usage: 3.1 MB

train_wol, test_wol are the train and test data for my dataset, using python_random_split function implemented in the code

When I run this

data = NCFDataset(train=train_wol, test=test_wol, seed=SEED) ----- ERROR

I really don't know whats wrong here

Traceback:


TypeError Traceback (most recent call last)

in ----> 1 data = NCFDataset(train=train_wol, test=test_wol, seed=SEED) ~/Desktop/Recommenders/reco_utils/recommender/ncf/dataset.py in __init__(self, train, test, n_neg, n_neg_test, col_user, col_item, col_rating, col_timestamp, binary, seed) 59 # initialize negative sampling for training and test data 60 self._init_train_data() ---> 61 self._init_test_data() 62 # set random seed 63 random.seed(seed) ~/Desktop/Recommenders/reco_utils/recommender/ncf/dataset.py in _init_test_data(self) 189 lambda row: row[self.col_item + "_negative"] 190 - row[self.col_item + "_interacted_test"], --> 191 axis=1, 192 ) 193 test_ratings = pd.merge( ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds) 6485 args=args, 6486 kwds=kwds) -> 6487 return op.get_result() 6488 6489 def applymap(self, func): ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in get_result(self) 149 return self.apply_raw() 150 --> 151 return self.apply_standard() 152 153 def apply_empty_result(self): ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_standard(self) 255 256 # compute the result using the series generator --> 257 self.apply_series_generator() 258 259 # wrap results ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_series_generator(self) 284 try: 285 for i, v in enumerate(series_gen): --> 286 results[i] = self.f(v) 287 keys.append(v.name) 288 except Exception as e: ~/Desktop/Recommenders/reco_utils/recommender/ncf/dataset.py in (row) 188 ] = test_interact_status.apply( 189 lambda row: row[self.col_item + "_negative"] --> 190 - row[self.col_item + "_interacted_test"], 191 axis=1, 192 ) TypeError: ("unsupported operand type(s) for -: 'float' and 'set'", 'occurred at index 10404') ...
gramhagen commented 5 years ago

i believe this happens if the data has duplicate rating entries for the same user-item pairing. you can check this with:

train.groupby(['userID', 'itemID'])['rating'].count().max() == 1

might be relevant for test as well?

karthikraja95 commented 5 years ago

@gramhagen, @miguelgfierro, @AaronHeee

Thanks for the suggestion. I checked out with the snippet that you shared. So, It's true for all the cases even in movielens (train_ml, test_ml) and in my dataset(train_wol, test_wol) as well.

train_ml.groupby(['userID', 'itemID'])['rating'].count().max() == 1 --- True test_ml.groupby(['userID', 'itemID'])['rating'].count().max() == 1 --- True train_wol.groupby(['userID', 'itemID'])['rating'].count().max() == 1 --- True test_wol.groupby(['userID', 'itemID'])['rating'].count().max() == 1 --- True

How can it work for movielens dataset and not mine. I'm confused

gramhagen commented 5 years ago

I think the other thing that can cause problems is if the userIDs are not common across train and test, you might check what the following look like to make sure you have the same number of users:

print(len(train.userID.unique()))
print(len(test.userID.unique()))

using python_stratified_split, could be a better way to handle this, often splitting randomly can lead to these kind of discrepancies that cause problems for the algorithms / evaluations

yueguoguo commented 5 years ago

Hey @karthikraja95 can you check the data? The error seems to happen during the negative sampling of the testing dataset. In the codes of

 # get negative pools for every user based on training and test interactions
            test_interact_status = pd.merge(
                test_interact_status, self.interact_status, on=self.col_user, how="left"
            )
            test_interact_status[
                self.col_item + "_negative"
            ] = test_interact_status.apply(
                lambda row: row[self.col_item + "_negative"]
                - row[self.col_item + "_interacted_test"],
                axis=1,
            )

where the row[self.col_item + "_negative"] may container only a single value, i.e., your item ID, thus it does not work because operand - does not support float and set. The reason may be that there might be users that have seen all the items, so that the "negative samples" contain 0 or 1 items, thus making the subtraction operation failed with a set. Put it in another way, if a user has interacted with all the items, there should be some exception handling operations for the negative sampling.

Can you filter out the users that have interacted with all the items and try again, and let us know if it works.

karthikraja95 commented 5 years ago

print(len(train_wol.userID.unique())) --- 5799 print(len(test_wol.userID.unique())) --- 2260

@gramhagen I tried those out, Like you mentioned its not same. Then I tried to split the data with python_stratified_split() and then printed the unique IDs

print(len(train_wol.userID.unique())) --- 7205 print(len(test_wol.userID.unique())) --- 614

Unique IDs are not same in both train and test set . But now it WORKS... THANKS FOR HELPING OUT!

karthikraja95 commented 5 years ago

@yueguoguo Thanks for the suggestion. The comment that you gave actually makes a lot of sense. Right now its working with python_stratified_split(). I might definitely encounter this error, its better to find those users and filter out.

karthikraja95 commented 5 years ago

@gramhagen, @miguelgfierro, @yueguoguo, @AaronHeee Thanks ALL! I am working on Autoencoder based Collaborative Filtering model for Implicit Data. I want a working model to compare my results with. I thought NCF would be a good model to go to. I am always open to suggestions, I would really appreciate if someone has anything in mind for me to try. Thanks for helping it work.

miguelgfierro commented 5 years ago

hey @karthikraja95 I'm working on an autoencoder https://github.com/microsoft/recommenders/issues/526. The algo is ready, I just need to find some time to clean the code, and add the notebook. I'll push to staging soon

karthikraja95 commented 5 years ago

@miguel-ferreira Perfect. I will take a look at it. Thanks for sharing.

miguelgfierro commented 5 years ago

closing this, feel free to reopen in case there are more doubts

gramhagen commented 5 years ago

This can be caused by a discrepancy between versions of your CUDA driver and the CUDA toolkit, I would suggest updating the NVIDIA CUDA toolkit which should give you a compatible driver as well. FWIW, I believe we’ve been testing with version 9, (though 10.1 is the latest).

scott

From: Karthik Raja notifications@github.com Sent: Tuesday, June 11, 2019 10:03 AM To: microsoft/recommenders recommenders@noreply.github.com Cc: Scott Graham Scott.Graham@microsoft.com; Mention mention@noreply.github.com Subject: Re: [microsoft/recommenders] [ASK] In NCF Deep dive and ncf_movielens notebook, I used my own dataset instead of movie lens, its has userID itemID and ratings (i used counts here as rating like implicit data). The notebook throws the following error? c...

Hey all! I got into another small issue.

InternalError Traceback (most recent call last) in 9 learning_rate=1e-3, 10 verbose=10, ---> 11 seed=SEED 12 )

~\Desktop\recommenders\recommenders-master\reco_utils\recommender\ncf\ncf_singlenode.py in init(self, n_users, n_items, model_type, n_factors, layer_sizes, n_epochs, batch_size, learning_rate, verbose, seed) 87 gpu_options = tf.GPUOptions(allow_growth=True) 88 # set TF Session ---> 89 self.sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 90 # parameters initialization 91 self.sess.run(tf.global_variables_initializer())

~.conda\envs\reco_gpu\lib\site-packages\tensorflow\python\client\session.py in init(self, target, graph, config) 1549 1550 """ -> 1551 super(Session, self).init(target, graph, config=config) 1552 # NOTE(mrry): Create these on first enter to avoid a reference cycle. 1553 self._default_graph_context_manager = None

~.conda\envs\reco_gpu\lib\site-packages\tensorflow\python\client\session.py in init(self, target, graph, config) 674 try: 675 # pylint: disable=protected-access --> 676 self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts) 677 # pylint: enable=protected-access 678 finally:

InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

When I checked the nvcc --version

Compiler driver version: nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Mon_Jan_9_17:32:33_CST_2017 Cuda compilation tools, release 8.0, V8.0.60

Any help appreciated with this. Do I need to update the driver or something else

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Frecommenders%2Fissues%2F816%3Femail_source%3Dnotifications%26email_token%3DABLUTWKVCW7KJSOI52TSSJTPZ6WCTA5CNFSM4HVCCQZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXNHJEQ%23issuecomment-500855954&data=02%7C01%7Cscott.graham%40microsoft.com%7C43793749ceef40135b3c08d6ee759063%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636958586030774047&sdata=SLOzjYICggWBR7UFGTDyEEHiWfoW0hZ3VZALCmU%2Fnt4%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABLUTWMKSJYZMHN3UU6FAMLPZ6WCTANCNFSM4HVCCQZQ&data=02%7C01%7Cscott.graham%40microsoft.com%7C43793749ceef40135b3c08d6ee759063%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636958586030774047&sdata=mdlRLzrSc82rzvW8L6CPitpgFYa%2F5dCfwRtJgytkcjA%3D&reserved=0.

karthikraja95 commented 5 years ago

@gramhagen Thanks for the feedback. I did updated my CUDA toolkit to 10.1 and now the code is working fine. Thanks

karthikraja95 commented 5 years ago

Hey all,

Isn there a way in the codebase to recommend top 10 items based on particular user ID and given a particular item ID?

Example userID - 1001 and given some item ID (1234) , is there a way to get top 10 recommended items based on the UserId and ItemID

gramhagen commented 5 years ago

the trained model has a predict method where you can provide a user and item, is that what you want?

It will only score that user-item pair though, so when you say you want top 10 recommended items for a user based on a single item, what does that mean? That the user has only rated one item?

gramhagen commented 5 years ago

Also, it's easier to track questions / problems if they are separate github issues. Since this one is already closed and related to a different concern it would be helpful to generate a new issue if there is still a question. -Thanks!