The conclusions in your paper raise some concerns.

rn5l / session-rec

Python-based framework for building and evaluating session-based and session-aware recommender systems.

385 stars 78 forks source link

Hi!

I still have some concerns about the parameter settings you choose. Would you please help me verify my working process? First, I download the Diginetica datasets from the official link. Next, I use the config file "conf/preprocess/session-based/diginetica.yml" and preprocess.py to generate processed datasets. Then, I use the config file "conf/save/diginetica/window/window_digi_sgnn.yml" and run_config.py to train the SR-GNN model. When I change the parameter settings (window_digi_sgnn.yml, line 35) from params: { lr: 0.0001, l2: 0.000007, lr_dc: 0.63, lr_dc_step: 3, nonhybrid: True, epoch_n: 10 } to params: { lr: 0.001, l2: 0.00001, lr_dc: 0.1, lr_dc_step: 3, nonhybrid: True, epoch_n: 10 }, the improvements can be significantly observed (both loss and HR & MRR results). I check this repository because SR-GNN has been proved more robust than the previous methods in all recent deep-learning literature on the Diginetica dataset. But, in your work Empirical analysis of session-based recommendation algorithms (UMUAI-2020), the reported HR@20 of SR-GNN is only 0.3638, much lower than their official report of 0.5073. I can understand that using the sub-dataset will degrade the performance of deep-learning-based methods, but the magnitude of degradation is too much.

Besides, I have noticed that the best neural method you claimed is GRU4REC (you use the official code of the improved version GRU4REC-TOPK (2018) instead of the original GRU4REC (2016)). However, according to the results in RepeatNet: A Repeat Aware Neural Recommendation Machine for Session-Based Recommendation (AAAI-2019), the HR@20 and MRR@20 are 0.4523 and 0.1416, but in your work Empirical analysis of session-based recommendation algorithms, the HR@20 and MRR@20 are 0.4639 and 0.1644, even higher than results on a larger dataset. So it's weird!

I am not sure whether I have made some mistakes in my experiments, and I would appreciate it if I could get constructive responses from you.

Best regards.

Hi,

First, what you did is correct in general, and thank you for sharing the optimal set you found. Many reasons can lead to different results on the same dataset, including using different preprocessing, the different approaches to tune hyperparameters, different training-validation-test splits, and the different evaluation methodologies. For example, for the SRGNN model: even though we had the same preprocessing as the preprocessing that is stated in the SRGNN paper, if you check, the reported statistics are not the same: number of sessions 55k VS. 204K and number of items 32K VS. 43K because we had 5 overlapped splits instead of a single split. For sure this can lead to different results on the same dataset. Moreover, as we had a different prepared dataset than the one used in the SRGNN paper, we tuned hyperparameters on the validation set. The details of the optimization phase are reported in our paper. Regarding using another set of hyperparameters as the optimal one, there should be a clear and reasonable approach, which should be done on the validation set and not on the test set, as in reality, we do not know the test set! We only have the training set, part of which will be used as the validation set to tune the hyperparameters. Maybe the optimal hyperparameters that we find on the validation set, is not the best one on the test set as well. However, we should not tune the hyperparameters on the test set. What we did is mentioned in our paper: we did a random search with 100 iterations on the defined parameter space (in our conf files), so it's possible we missed the actual best configuration. However, trying different hyperparameter sets on the test set (not the validation test) to choose the one that leads to the better/best results is not correct. Furthermore, even with the same hyperparameters, if you use different training and test sets, or different evaluation methodologies, you may get completely different results. That's why it is important to be clear in the paper about all of them, so one can understand that the reported results were produced under which conditions. The same reasons for the different reported results for the GRU4Rec in different papers.

Best regards

rn5l / session-rec

The conclusions in your paper raise some concerns. #27