Closed ArnoldGaius closed 3 years ago
I'm trying to reproduce your problem. But I got the error "one of the variables needed for gradient computation has been modified by an inplace operation" when I use torch 1.6.0 to run deepctr-torch 0.2.1. What's the torch version when you run deepctr-torch 0.2.1? (deepctr-torch 0.2.1 & torch>=1.5 will cause this error, which has been fixed in v0.2.2. So I think you might run deepctr-torch 0.2.1 with another torch version)
In addition, could you provide a set of parameters? From your files, I notice that the parameters in 0.2.1 and 0.2.3 is different (batch_size
, dnn_hidden_units
, l2_reg_dnn
and dnn_dropout
are different).
0.2.1:
0.2.3:
I'm trying to reproduce your problem. But I got the error "one of the variables needed for gradient computation has been modified by an inplace operation" when I use torch 1.6.0 to run deepctr-torch 0.2.1. What's the torch version when you run deepctr-torch 0.2.1? (deepctr-torch 0.2.1 & torch>=1.5 will cause this error, which has been fixed in v0.2.2. So I think you might run deepctr-torch 0.2.1 with another torch version)
Thank you for your reply.The Version I used to run v0.2.1 is torch 1.4.0. I am sorry for the inconsistent parameters in the file I uploaded.In fact, batch_size, dnn_hidden_units, l2_reg_dnn and dnn_dropout should be adjusted to 1024, (400,400,400), 0, 0.3, just adjust according to the parameters of version 0.2.1
:joy: This is because you set a smaller lr (1e-4) in your v0.2.3 folder: remove this and use the default lr(1e-3), you will get a normal performance.
Besides, it not recommended to set learning rate in basemodel.py
, you can set lr more easily in main file by:
from torch.optim import Adam
...
model.compile(Adam(model.parameters(),1e-4), "binary_crossentropy",
metrics=["binary_crossentropy", "auc"], )
This is listed in https://deepctr-torch.readthedocs.io/en/latest/FAQ.html#set-learning-rate-and-use-earlystopping
Besides, it not recommended to set learning rate in
basemodel.py
, you can set lr more easily in main file by:from torch.optim import Adam ... model.compile(Adam(model.parameters(),1e-4), "binary_crossentropy", metrics=["binary_crossentropy", "auc"], )
This is listed in https://deepctr-torch.readthedocs.io/en/latest/FAQ.html#set-learning-rate-and-use-earlystopping
Yes, the learning rate of the adam optimizer you mentioned is textbook-like, and I am sorry for my confusing coding style. However, I have modified the learning rate to 1e-3 in the code (Setting the learning rate to 1e-4 will make it easier to fall into a local minimum, in version 0.2.3, the learning rate setting of 1e-3 will make the model quickly overfit), and the gap in AUC still exists. I have tried each of them five times. As you can see from the picture below, the 0.2.3 version of the model fits very quickly and will be completed in the first epoch, while the 0.2.1 version of the model learns more slowly, the loss and AUC obtained between the two version of model have huge difference.
This is caused by the L2 regularization. Set all the l2_reg parameters (l2_reg_linear
, l2_reg_embedding
, l2_reg_dnn
) to 0 in v0.2.3, you will get the same performance with v0.2.1.
In fact, regularization only works from v0.2.2, where we fixed the bugs about regularization. In previous versions, the reg_loss
is computed only once by the initial parameters, which means that reg_loss
is just a constant. This bug can be found here.
In v0.2.2, we fixed this bug by storing the necessary parameters in self.regularization_weight
and calculate reg_loss
in each iteration:
https://github.com/shenweichen/DeepCTR-Torch/blob/bc881dcd417fec64f840b0cacce124bc86b3687c/deepctr_torch/models/basemodel.py#L371-L386
https://github.com/shenweichen/DeepCTR-Torch/blob/bc881dcd417fec64f840b0cacce124bc86b3687c/deepctr_torch/models/basemodel.py#L228-L230
However, it's strange that the model performance reduces when we actually use L2 regularization. Perhaps your dataset (122k samples) is too small. So I use a larger dataset to conduct an experiment. I run FiBiNET with the sample parameters on a subset of avazu dataset (data in the first 3 days, 13.3 million samples in total, 9.46M for training, 0.77M for validation. 3.1M for test). Using L2 reg receives significant improvement:
no l2 reg : l2 reg = 1e-4 :
I suggest you to use more samples of criteo. Besides, I can provide the necessary codes to you if you're interested in my experiment.
Cause
This is caused by the L2 regularization. Set all the l2_reg parameters (
l2_reg_linear
,l2_reg_embedding
,l2_reg_dnn
) to 0 in v0.2.3, you will get the same performance with v0.2.1.In fact, regularization only works from v0.2.2, where we fixed the bugs about regularization. In previous versions, the
reg_loss
is computed only once by the initial parameters, which means thatreg_loss
is just a constant. This bug can be found here.In v0.2.2, we fixed this bug by storing the necessary parameters in
self.regularization_weight
and calculatereg_loss
in each iteration: https://github.com/shenweichen/DeepCTR-Torch/blob/bc881dcd417fec64f840b0cacce124bc86b3687c/deepctr_torch/models/basemodel.py#L371-L386Solution
However, it's strange that the model performance reduces when we actually use L2 regularization. Perhaps your dataset (122k samples) is too small. So I use a larger dataset to conduct an experiment. I run FiBiNET with the sample parameters on a subset of avazu dataset (data in the first 3 days, 13.3 million samples in total, 9.46M for training, 0.77M for validation. 3.1M for test). Using L2 reg receives significant improvement:
no l2 reg : l2 reg = 1e-4 :
I suggest you to use more samples of criteo. Besides, I can provide the necessary codes to you if you're interested in my experiment.
Thank you for your help. I cannot know this bug that was fixed in version 0.2.2 without your notification. I have modified l2 according to your prompt, and the result is the same as you said. I appreciate your willingness to share the necessary code. My email is jiangcmd@foxmail.com Look forward to your favourable reply. Sincerely Bain
Experiment codes on avazu dataset can be found in the experiment
branch:
https://github.com/shenweichen/DeepCTR-Torch/tree/experiment
Follow the steps in README.md
.
Please feel free to contact me if you have any questions.
Describe the question I have tried the 0.2.1 and 0.2.3 versions of DeepCTR respectively, but the same data set and parameters obtained a large difference in AUC and Loss in different versions of DeepCTR.In this experiment, I tried FibiNet and NFM respectively. The problem I encountered was that the AUC of DeepCTR version 0.2.3 was lower than version 0.2.1
Additional context I have uploaded two versions of the project to Baidu Netdisk.The test data is a 25M Criteo data set randomly selected. After downloading, run the run.py of the two projects to see different AUC running results.This is the download link:
链接: https://pan.baidu.com/s/1191pHvL3wMaCM5TsAo4jgA 提取码: hexw
Please download and troubleshoot this issue, thank you again for your contribution.
Operating environment