rdevooght / sequence-based-recommendations

MIT License
370 stars 126 forks source link

some questions about rnn #2

Open kwonmha opened 7 years ago

kwonmha commented 7 years ago

hello. I read your paper and am looking into your codes. And got some questions about rnn.

in rnn_one_hot.py : _prepare_networks()

  1. at your code : if not self.use_movies_features: l_recurrent = self.recurrent_layer(self.l_in, self.l_mask, true_input_size = self.n_items + self.n_optional_features(), only_return_final=True) there's NOT in if condition. why did you set true_input_size like above when use_movies_features is false?

  2. self.recurrent_layer() calls call () of recurrent_layers.py It seems that it only returns 1 layer - prev_layer, though you write for loop in that part. I think it returns multiple layers when RecurrentLayers.layers is set like 100-50-50 like your example Does it work as you intended?

  3. what does l_last_slice do? Should it exist?

TY for your help.

rdevooght commented 7 years ago

Hi,

Regarding your first question: Firstly I have to confess that all the parts about movies_features, optional_features, etc. is kind of legacy code that I used in my experiments to exploit extra information from the movielens dataset, but I didn't want it in this public repo because it worked only on movielens. I removed some of it from the code, but some pieces stayed in (partly due to laziness, and partly because I hope to be able to generalize it to any dataset at some point). That being said, here is what's happening in the lines you pointed to: using the argument "true_input_size" when constructing the recurrent layer will cause the recurrent layer to accept a sparse encoding of the inputs. Unfortunately, in my implementation all the inputs must have the same number of non-zero element (e.g. one for the movie id, one for the rating and one for the user age = 3 non-zero element). This doesn't work when we use the "movies_features" of movielens, because those features are actually movies genres, and a movie can have any number of genres, which means that the number of non-zero elements in the input is not the same for every movie. Therefore when you use those movie features the recurrent layer will use a dense encoding of the input (you can see this difference reflected in the _get_features method of the RNNBase class). This could have been avoided with a more flexible type of sparse encoding, but it required more involved modification of the Lasagne recurrent layers.

Second question: self.recurrent_layer() does work as intended. It indeed returns only one layer (the last one), but this layer has a reference to its input layer, which has a reference to its input layer, and so on. So by returning only the last layer I actually return the whole network.

Third one: Indeed, l_last_slice is exactly the same as l_recurrent and I could have simply used l_recurrent. This is a relic of a time when they referred to different things, but I should clean that up now.

I hope this helps !

kwonmha commented 7 years ago

Thanks for reply! Let me ask a few more questions.

  1. All parameters of networks are shared for all users in each set, is it right?

  2. And the training, testing, predicting process are conducted by batch_size * users?

  3. The number of movies each user rated will be different according to users. It needs the network process input with various length. If one's rating record is shorter, does the network recurrents less?

  4. Does network predict only 1 time(k movies) per each user? If not, the movie with the largest softmax value becomes next input?

I want to confirm some details. TY.

rdevooght commented 7 years ago

Hi,

  1. that's right
  2. I'm not sure what you mean by that. The training uses mini-batches whose size is set by batch_size. One epoch consist in one loop over all the users. Does that help ?
  3. In Lasagne, RNNs use a mask, i.e. an additional input consisting of ones and zeros, with the zeros indicating that the corresponding step should be ignored. So the max length is the length of the mask, but any smaller length is possible.
  4. It ranks all the items after the last step, and recommends the top k. I tried feeding the largest item as next input instead to observe which are the next k predicted items, but it gives worse results.
kwonmha commented 7 years ago

Thanks for your reply. I'm here to ask another question.

In rnn-base.py, your code made input not in form of one-hot encoding. It was like [[23], [35], [57]...]. And the target is like [[60]]. In this case, how does the Theano.categorical_crossentropy calculate error? Does it do automatically same as if the input, target are encoded in one-hot? I can figure out how to calculate error with CCE when the input, target are encoded as one-hot. But can't imagine when they are not encoded. And there's no parameter to determine type of encoding. And how about there are multiple movies in target? What happens in that case? Does it work with CCE or should I use other kind of loss function?

And I couldn't get sps and others like your paper. I got sps 30% for ml1m dataset, 37% for netflix dataset, which are 3%p lower than those on your paper. What exact model of rnn is used?

Thanks.

kwonmha commented 7 years ago

Hi, It's sad that a month has gone but didn't get any reply from you. I have another question. I tested with hinge loss function and got 11.07% sps for netflix dataset. Also other metrics were much worse than those I got when used cce. Did I get right result? Or hasn't you completed with multi target loss function? It would help me a lot if you tell me the result you got with hinge loss.

Thanks.

rdevooght commented 7 years ago

Hi,

Sorry for the late reply. The categorical cross-entropy function of theano can indeed deal with both one-hot encodding and as a list of index. http://deeplearning.net/software/theano/library/tensor/nnet/nnet.html#theano.tensor.nnet.nnet.categorical_crossentropy

If you want to use multiple targets you have to use the hinge loss, but it is harder to train.

Concerning the results that you get, did you tune properly the learning rate and used early stopping to avoid overfitting ? Try Adagrad with a learning rate of 0.1 or Adam with its default parameters. I'll see if I can upload a model that I trained.

Cheers, Robin

kwonmha commented 7 years ago

Hi,

It would help me a lot if you upload models you trained. The more, the better. Particularly, I want to check cce and hinge loss models for netflix dataset.

Btw, what do you mean, it's harder to train with hinge loss? Does it take more time to train?

Thanks

rdevooght commented 7 years ago

Yes it tends to take more time with the hinge loss.

kwonmha commented 7 years ago

Hi, Is it able to upload the models you performed experiment with? The more kinds of models - RNN CCE, hinge, BPR-MF, UKNN, etc, - the better.

And what was the min_user_activity and item popularity set for preprocessing? Maybe it could affect the performance. Thanks.

rdevooght commented 7 years ago

Hi,

I tried to add the data and the models to the repo using git lfs, but I couldn't make it work so far. So in the meantime you can get some models there: http://iridia.ulb.ac.be/~rdevooght/netflix.tar.gz

kwonmha commented 7 years ago

Hi, the data you uploaded was greatly helpful. Moreover, I found that you trained your model with max_length 30 on netflix dataset. Did you also applied same max_length parameter on movielens 1m dataset? And I would like to know that if the model which made the results on your paper, was obtained when the validation performance was best or the model was chosen among multiple candidate models. Thanks for your help!

rdevooght commented 7 years ago

Hi,

Indeed I used max_length = 30 on movielens as well. It is a rather arbitrary choice, but I observed that using longer sequences barely affects the results, while slowing down the training.

The parameters of the models where tuned on a validation set using a random parameter search. We kept the parameters that optimised the sps@10. The test is made with models trained using those "optimal" parameters + early stopping based on the validation set.

kwonmha commented 7 years ago

Oh, thanks for answering right away. Sorry for questioning again, but did you removed some rare items and users who saw few movies during preprocessing? And I think that selecting users randomly during preprocessing step could affect the model's performance. I wrote this comment as I keep failing to get a model with performance same as the one in your paper unfortunately. Thanks!

rdevooght commented 7 years ago

You can find the precise datasets and the training/validation/test splits that we used here: iridia.ulb.ac.be/~rdevooght/rnn_cf_data.zip

kwonmha commented 7 years ago

Hi, I've just read your recent paper and got a question. Have you tried Recall@20, MRR@20 on RSC15 data with your model? It would be great to see comparison between your model and the one in "Improved RNN for Session-based Recommendations" and I thought you might did it. Please let me know the results made by your model in the same setting(metric) as "Improved RNN for Session-based Recommendations" if you did it. Or I can do it.

kwonmha commented 7 years ago

Well, I'm training rnn-cce model for RSC15 data. It's been 29 epochs, 5 days so far but the highest sps marked is 0.32 and highest recall is about 0.26. Could you give me any advice?