lpworld / DASL

12 stars 2 forks source link

[🐛BUG] Different results to those reported in the paper (+ 3 typos in the code) #2

Open alarca94 opened 2 years ago

alarca94 commented 2 years ago

Hello! I found your paper on cross-domain quite interesting and went ahead to try to reproduce the results of the paper with the code as is. However, I could not match the reported results with any of my experiments specially in terms of AUC (see Table at the end). I would like to know if this is the definite version that you used to report scores in the paper. If this was not the final version, would it be possible to obtain the final code or, at least, could you mention what changes to the uploaded code are necessary to reproduce the results? Thank you so much in advance!

Bug description:

I found 3 typos:

  1. (minor) train.py, line 29: data1 = data1[['user_id','video_id','click','hour']]. I guess you wanted to use data2 here as otherwise line 28 and line 29 are the same.
  2. (major) train.py, lines 82-83: Both trainset2 and testset2 are being obtained from data belonging to domain 1 which means that the models are only trained and tested on domain 1 CTR prediction trainset2 = trainset1[:len(trainset2)//batch_size*batch_size] testset2 = testset1[:len(testset2)//batch_size*batch_size]
  3. (major) model.py, line 145: self.hist_cross_2: uij[1],. The cross historical sequence is in the index 2. Otherwise, both self.hist_2 and self.hist_cross_2 receive the same sequence and there is no cross reference to learn from the other domain.

Experiment results:

The experiments I performed were using a) full Amazon dataset with the typos intact, b) full Amazon dataset without typos and c) sample dataset of the repository with typos. Next, I report the best result obtained for each of the metrics within the 1000 epochs of training (meaning that results reported for each experiment may belong to a different epoch number):

Full with typos Full without typos Sample with typos Original paper results
Train Toys AUC 0.8937 0.8931 0.8947 -
Test Toys AUC 0.6429 0.5999 0.5302 0.8520
Test Toys HR 0.8174 0.8392 0.8098 0.7844
Train Videogames AUC 0.8935 0.8788 0.8946 -
Test Videogames AUC 0.6183 0.7103 0.5476 0.8511
Test Videogames HR 0.8174 0.7308 0.8098 0.7844

Note for the experiment:

lpworld commented 2 years ago

Hello! Thank you so much for your interest in my paper and your reproduction efforts. That is really amazing work.

First of all, please accept my apology that there are several typos in this repository (which have now been corrected). I didn't do a thorough check of the codes when releasing them.

Second, regarding the inconsistencies of the reported number in this paper, it came to our notice before (as some readers have written emails to us) that we have used an incorrect AUC and HR@10 evaluation function in the original paper. Therefore, the actual numbers are somewhat different from what we reported in the paper (and very close to your numbers listed here). I plan to list the corrected numbers here but got distracted from other tasks. I will do so as soon as possible.

While I am apologetic for all these confusions, I do want to point out that the main conclusions of the paper do not change: our proposed model still outperforms the baseline models we listed in the paper, and it has achieved superior performance in our online experiments as well.

Thanks again for your comments, and wish you all the best in your research projects.

Best Regards, Pan

alarca94 commented 2 years ago

Hi Pan! Thank you so much for your quick reply.

I understand now why the results doesn't match. From my understanding of your reply, you re-ran the experiments with the correct metric implementation (which is the current version in the repository) and your model still came on top (with respect to the other baselines). Is this correct?

Anyway, could it be possible for you to upload (maybe in another branch) the code with a running script to reproduce the exact experiment from the paper? It would be just for a sake of easy to compare. Again, thank you so much in advance!

Kind regards,

Alejandro

lpworld commented 2 years ago

Hi Alejandro,

Thanks for your comment. Indeed, we have re-run the experiment with the correct AUC/HR metric, and our model still significantly outperforms the other baselines we listed in the paper. I will post the updated results in this repository later.

Sorry for the confusion. Originally we were using a piece of code from Stake Overflow which we thought was computing AUC/HR, but it turned out to be computing something totally different. Unfortunately, I do not have those codes now, but we would like to thank you again for your careful observation and all the hard work in reproducing the results of this paper. Wish you all the best.

Best Regards, Pan

hulkima commented 2 years ago

Hello! I found your paper on cross-domain quite interesting and went ahead to try to reproduce the results of the paper with the code as is. However, I could not match the reported results with any of my experiments specially in terms of AUC (see Table at the end). I would like to know if this is the definite version that you used to report scores in the paper. If this was not the final version, would it be possible to obtain the final code or, at least, could you mention what changes to the uploaded code are necessary to reproduce the results? Thank you so much in advance!

Bug description:

I found 3 typos:

  1. (minor) train.py, line 29: data1 = data1[['user_id','video_id','click','hour']]. I guess you wanted to use data2 here as otherwise line 28 and line 29 are the same.
  2. (major) train.py, lines 82-83: Both trainset2 and testset2 are being obtained from data belonging to domain 1 which means that the models are only trained and tested on domain 1 CTR prediction trainset2 = trainset1[:len(trainset2)//batch_size*batch_size] testset2 = testset1[:len(testset2)//batch_size*batch_size]
  3. (major) model.py, line 145: self.hist_cross_2: uij[1],. The cross historical sequence is in the index 2. Otherwise, both self.hist_2 and self.hist_cross_2 receive the same sequence and there is no cross reference to learn from the other domain.

Experiment results:

The experiments I performed were using a) full Amazon dataset with the typos intact, b) full Amazon dataset without typos and c) sample dataset of the repository with typos. Next, I report the best result obtained for each of the metrics within the 1000 epochs of training (meaning that results reported for each experiment may belong to a different epoch number):

Full with typos Full without typos Sample with typos Original paper results Train Toys AUC 0.8937 0.8931 0.8947 - Test Toys AUC 0.6429 0.5999 0.5302 0.8520 Test Toys HR 0.8174 0.8392 0.8098 0.7844 Train Videogames AUC 0.8935 0.8788 0.8946 - Test Videogames AUC 0.6183 0.7103 0.5476 0.8511 Test Videogames HR 0.8174 0.7308 0.8098 0.7844 Note for the experiment:

  • I used tensorflow 1.4.0 as recommended in the code (with tf_xla_cpu_global_jit flag for speed-up computations)
  • I downloaded the version of the reviews Amazon dataset that is presented in here

Hi! I am a doctoral student from China, I am also focus on the cross-domain recommendation and plan to reproduce this paper with the code of this repository. I just found out that you oppened this isus, could you please share the correct reproduced code with me? I promise it will only be used for academic research, thank you!