recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
19.28k stars 3.11k forks source link

RBM - how to get the affinity matrix from item_back_dict and user_back_dict #868

Open ghost opened 5 years ago

ghost commented 5 years ago

Description

I am trying to implement AzureML Hyperdrive based hyperparameter tuning of the RBM algorithm using example notebooks. I have a working RBM notebook with my dataset and I am using svd_training.py as an template for building my rbm_training,py file. As part of the RBM process an affinity matrix is created and the training and test set is built from the stratified sampler. I looked at the code and there is an optional parameter save_path that stores 4 numpy output files: item_back_dict.npy, item_dict.npy, user_back_dict.npy and user_dict.npy after invoking as follows

am1m = AffinityMatrix(DF = data, **header, save_path = DATA_DIR)

I am uploading the train and validate pkl data files to the default datastore from my local machine

During evaluation the following code requires the affinity matrix

top_k_df_1m = am1m.map_back_sparse(top_k_1m, kind = 'prediction') test_df_1m = am1m.map_back_sparse(Xtst_1m, kind = 'ratings')

How do I regenerate the affinity matrix object in the script that will be run remotely (rbm_training.py)? I was hoping to be able to use the four numpy files to enable map_back_sparse? I hope I don't have to upload the entire dataset and then regenerate an AffinityMatrix object remotely.

The AffinityMatrix code in sparse.py mentions that the numpy files can be use with a trained model but not sure how to load these 4 files to regenerate an AffinityMatrix object as the remote script executes.

Other Comments

yueguoguo commented 5 years ago

Add @WessZumino for advice.

WessZumino commented 5 years ago

@atimesastudios let me see if I got your question right: you create the user/affinity (UA) matrix (for train and test sets) locally on your machine and then upload these on your remote DSVM for training (or hyper parameter tuning). After training, you would like to get back the pandas df format from the UA-matrix in order to use the ranking metrics.

Yes, in the current version of sparse.py the load feature of the .npy files is missing, thanks for catching this. I will add the feature so that you can use your workflow.

In the meanwhile, you could generate the UA-matrix directly on the DSVM, both the pandas df and the UA-matrix should be comparable in size if I am not mistaken, so there is no advantage in uploading the latter instead of the former.

Also, note that top_k_1m is the UA-matrix of only the top k items per user. If you want the full UA-matrix you should use the predict() method of the rbm class instead of the recommend_k_items() one.

ghost commented 5 years ago

Thanks @WessZumino, I appreciate the addition of the load feature in sparse.py. In the meantime I will follow your advice.

ghost commented 5 years ago

@WessZumino, predict takes two arguments, but the second one doesn't seem to get used. The documentation string only contains information about x.

def predict(self, x, maps):