Some thing wrong with the evaluation process or the sampling method?

iYiYaHa / PyTorch-IRGAN

PyTorch-IRGAN is a PyTorch version implementation of the item recommendation part of IRGAN.

11 stars 7 forks source link

Some thing wrong with the evaluation process or the sampling method? #1

Open Namco0816 opened 4 years ago

Namco0816 commented 4 years ago

Hey I actually implemented the model by myself and running it using the metrics provided in your repo. The performance is not good and NDCG3 always higher than the NDCG10 with 10 or 20 percent. I thought that there may be some mistakes with my model. Then I simply run the code provided in your repo to check your model's result. However, the result is the same as my own. I think it seems like no error in the model implementation. So I am wondering that maybe there are some mistakes in the evaluation process or the sampling methods?

iYiYaHa commented 4 years ago

Sorry for the delay in getting back to you.

Actually, the metrics evaluation part used in this repo is copied from other repositories. From the formula and the implementation of NDCG, it looks fine. So if you think something is wrong with the evaluation process, could you please cite the line of the code? And we can check it again.

Is the sampling methods you referred is the importance sampling method used in their official implementation? My implementation of the importance sampling method tries to speed the sampling process up in a matrix-wise manner. I think the implementation is correctly following the formula.

BTW, from my own experiments on my own implementation and other implementations such as RecQ, the result is not as good as reported. So I'm not sure whether my implementation is correct after trying to tune the hyperparameters many times.

Namco0816 commented 4 years ago

Yes, it seems like there is something wrong with the evaluation process or the sampling methods. I simply translate the code from the official TF implementation to the PyTorch and run with both the model I re-implemented and you provided. The performance seems not bad and only lower than reported results with 2% or 3% NDCG points. I noticed that in the result generated by your scripts, the NDCG10 and P10, are always higher than the NDCG5 and P5 and followed by the NDCG3 and P3, this order seems not correct. However, I checked your codes again and again and could find nothing wrong. That is weird.

iYiYaHa commented 4 years ago

Sorry again for the delay in getting back to you.

Do you mean your implementation of IRGAN and results performs well while the results of my scripts are not correct?

No matter whether my implementation of the model is correct or not, the results should be with an order like NDCG10>=NDCG5>=NDCG3 which is not what truly happens with my scripts currently. This may indicate that there is something wrong with my evaluation process. However, I have checked the implementation of evaluation metrics computing process many times. And I also didn't find anything wrong.

This makes me feel bad...

Jeriousman commented 2 years ago

Was there any error? or have you fixed it? @iYiYaHa

iYiYaHa commented 2 years ago

Was there any error? or have you fixed it? @iYiYaHa

Hi, @Jeriousman. Sorry for the late reply. I'm not sure if there is any error about this implementation's evaluation process. I haven't run the model for about two years. So this implementation is just for your reference currently.

Jeriousman commented 2 years ago

Thank you for the heads-up! @iYiYaHa