Open tengshaofeng opened 6 years ago
@tengshaofeng I also found the training issue, have you fixed it yet ?
@lipond , i try to modify his code to test on cifar test set of 10000 samples.
@tengshaofeng what is the result? Does your training process seems normal? I run the trainning codes but got fixed cost and apparently it is with problems.
@tengshaofeng @lipond thank you for giving me comments. I tested the model in cifar-10 for debug and it worked, but I didn't save the result... Could you tell me the problem so that I can fix the code?
@koichiro11 The training log as below seems like not normal, right? Thanks.
EPOCH: 0, Training cost: 2.36071372032, Validation cost: 2.36695027351, Validation Accuracy: 0.0942 EPOCH: 5, Training cost: 2.36067867279, Validation cost: 2.36695027351, Validation Accuracy: 0.0942 EPOCH: 10, Training cost: 2.36069869995, Validation cost: 2.36695027351, Validation Accuracy: 0.0942
@xiaoganghan hmm... As you said, the training log is not good. I'm not sure but maybe it is because the number of attention module is too many (in my experience).
I had tested cifar-10 and the training process had been good. I will check my code and train again when I have time.
@koichiro11 Thank you for the reply. it's wired. The above log is output from your lastest code without any changes. It's on cifar-10 for sure.
I want to train it on cifar-10 and do some visualization on the masks to see how well the residual attention model works. In this case, do you think it's training parameter issue or attention module issue? Is the latest commit the version you used to train on cifar-10 successfully? Or is there any previous commits I should try? I only want to train on cifar-10 any. Thank you again for your prompt reply.
@lipond , i have test on 10000 test samples in cifar-10, it just have the accuracy of 87%
@tengshaofeng what changes have you made to achieve 87% accuracy? Thank you.
@xiaoganghan, @tengshaofeng, can you provide your test script? I have made one but I did get a very low accuracy. Thank you.
sorry, the result is not from this code. it is from another pytorch project. the following is the result: Accuracy of the model on the test images: 86 % Accuracy of plane : 88 % Accuracy of car : 93 % Accuracy of bird : 79 % Accuracy of cat : 74 % Accuracy of deer : 85 % Accuracy of dog : 79 % Accuracy of frog : 89 % Accuracy of horse : 90 % Accuracy of ship : 92 % Accuracy of truck : 91 %
@xiaoganghan , i also meet the problem your said, loss do not decrease. @lipond , @josianerodrigues you can refers my pytorch code project: https://github.com/tengshaofeng/ResidualAttentionNetwork-pytorch
@tengshaofeng Thank you for the reply and for share your project with us.
@josianerodrigues , my presure.
Hi @tengshaofeng, how long does this code take to run on average? Sorry for taking your time and thank you for considering my doubt.
this code about 7 minutes every 5 epochs. the following is the training log:
start to train ResidualAttentionModel load CIFAR-10 data... load data from pickle build graph... check shape of data... train_X: (45000, 32, 32, 3) train_y: (45000, 10) start to train... EPOCH: 0, Training cost: 2.3612446785, Validation cost: 2.36055040359, Validation Accuracy: 0.1006 save model... EPOCH: 5, Training cost: 2.36119961739, Validation cost: 2.36055040359, Validation Accuracy: 0.1006 save model... EPOCH: 10, Training cost: 2.36124396324, Validation cost: 2.36055040359, Validation Accuracy: 0.1006 save model... EPOCH: 15, Training cost: 2.36119961739, Validation cost: 2.36055040359, Validation Accuracy: 0.1006 save model... EPOCH: 20, Training cost: 2.36122179031, Validation cost: 2.36055040359, Validation Accuracy: 0.1006 save model... EPOCH: 25, Training cost: 2.36119961739, Validation cost: 2.36055040359, Validation Accuracy: 0.1006 save model... EPOCH: 30, Training cost: 2.36122179031, Validation cost: 2.36055040359, Validation Accuracy: 0.1006 save model... save model...
it seems not converaged.
I got the same result, it did not really converge. I was referring to the run time of pytorch implementation that you shared with us.
Just now I rerun my code, the following is the train log of my pytorch implementation:
for res_att_92 net: Epoch [1/100], Iter [100/1429] Loss: 2.1525 Epoch [1/100], Iter [200/1429] Loss: 1.7000 Epoch [1/100], Iter [300/1429] Loss: 1.7273 Epoch [1/100], Iter [400/1429] Loss: 1.4131 Epoch [1/100], Iter [500/1429] Loss: 1.5592 Epoch [1/100], Iter [600/1429] Loss: 1.6161 Epoch [1/100], Iter [700/1429] Loss: 1.3315 Epoch [1/100], Iter [800/1429] Loss: 1.0377 Epoch [1/100], Iter [900/1429] Loss: 1.3492 Epoch [1/100], Iter [1000/1429] Loss: 1.3490 Epoch [1/100], Iter [1100/1429] Loss: 1.3188 Epoch [1/100], Iter [1200/1429] Loss: 1.3300 Epoch [1/100], Iter [1300/1429] Loss: 1.1882 Epoch [1/100], Iter [1400/1429] Loss: 0.9603 the epoch takes time: 1051.79760003 Epoch [2/100], Iter [100/1429] Loss: 0.9891 Epoch [2/100], Iter [200/1429] Loss: 1.2262 Epoch [2/100], Iter [300/1429] Loss: 0.9173 Epoch [2/100], Iter [400/1429] Loss: 1.1978 Epoch [2/100], Iter [500/1429] Loss: 0.9160 Epoch [2/100], Iter [600/1429] Loss: 0.8897 Epoch [2/100], Iter [700/1429] Loss: 0.7859 Epoch [2/100], Iter [800/1429] Loss: 0.8977 Epoch [2/100], Iter [900/1429] Loss: 0.6515 Epoch [2/100], Iter [1000/1429] Loss: 0.9553 Epoch [2/100], Iter [1100/1429] Loss: 0.9544 Epoch [2/100], Iter [1200/1429] Loss: 1.2661 Epoch [2/100], Iter [1300/1429] Loss: 0.9071 Epoch [2/100], Iter [1400/1429] Loss: 0.7281 the epoch takes time: 1053.76822901
for the res_att_56 net: Epoch [1/100], Iter [100/1429] Loss: 1.9677 Epoch [1/100], Iter [200/1429] Loss: 1.7845 Epoch [1/100], Iter [300/1429] Loss: 1.7899 Epoch [1/100], Iter [400/1429] Loss: 1.7015 Epoch [1/100], Iter [500/1429] Loss: 1.4097 Epoch [1/100], Iter [600/1429] Loss: 1.4999 Epoch [1/100], Iter [700/1429] Loss: 1.2078 Epoch [1/100], Iter [800/1429] Loss: 1.4107 Epoch [1/100], Iter [900/1429] Loss: 1.6492 Epoch [1/100], Iter [1000/1429] Loss: 1.8750 Epoch [1/100], Iter [1100/1429] Loss: 1.7730 Epoch [1/100], Iter [1200/1429] Loss: 1.3797 Epoch [1/100], Iter [1300/1429] Loss: 1.2181 Epoch [1/100], Iter [1400/1429] Loss: 1.3505 the epoch takes time: 654.214586973 Epoch [2/100], Iter [100/1429] Loss: 1.1204 Epoch [2/100], Iter [200/1429] Loss: 1.7548 Epoch [2/100], Iter [300/1429] Loss: 1.3137 Epoch [2/100], Iter [400/1429] Loss: 1.0649 Epoch [2/100], Iter [500/1429] Loss: 0.9719 Epoch [2/100], Iter [600/1429] Loss: 1.2086 Epoch [2/100], Iter [700/1429] Loss: 0.9056 Epoch [2/100], Iter [800/1429] Loss: 0.8379 Epoch [2/100], Iter [900/1429] Loss: 0.6485 Epoch [2/100], Iter [1000/1429] Loss: 0.8086 Epoch [2/100], Iter [1100/1429] Loss: 0.9019 Epoch [2/100], Iter [1200/1429] Loss: 0.9073 Epoch [2/100], Iter [1300/1429] Loss: 0.9322 Epoch [2/100], Iter [1400/1429] Loss: 1.1767 the epoch takes time: 658.405325174
so it takes about 1000 and 650 seconds one epoch for res_att_92 and res_att_56 respectively, and the train batch size is set as 35.
Does the log show empty? How long does one epoch run?
@josianerodrigues , i am runing the code just now , the answer is the above.
Thanks for your help, @tengshaofeng :)
@tengshaofeng @lipond @xiaoganghan @josianerodrigues
thank you for giving me comments and discussing.
Now I fixed the code and created pull request.
The reason why loss doesn't reduce is softmax function
. The output from final FC layer is relatively large, so I introduce layer normalization before FC layer.
Now the loss reduces well, but I don't know why the output from final FC layer is relatively large (actually, there is no mention like this, including the necessity of layer normalizaton, in the paper).
if you can find the reason or error, please tell me.
Thanks again.
@koichiro11 Thank you for the corrections and to share with us.
@koichiro11, Could you please share the tensorflow, keras and python versions that you used to run this implementation?
hi every body, @lipond @koichiro11 @xiaoganghan @josianerodrigues I modified my code at https://github.com/tengshaofeng/ResidualAttentionNetwork-pytorch . change the input with 3232 instead of 224224, also build a new architecture for cifar10 called ResidualAttentionModel_92_32input. and get the new result on cifar10 test set : Accuracy of the model on the test images: 92.66% I am sure you can do better based my code, because there are some tricks to try.
Hi @tengshaofeng, I would like to use the batch size 64, but the memory is overflowing. Is there any optimization that can be done so it does not happen?
@josianerodrigues ,reduce the batchsize until overflowing do not happen. or you can use multi-gpus with Distributed training. or you can only use the the network with less parameters like res_att_56.
@lipond @koichiro11 @xiaoganghan @josianerodrigues
hi, everyone, i modify my code when decaying learning rate. At last version of my code, i modify my optimizer to sgd as paper said, but i have not modify it when decaying learning rate. So the newest result now on cifar10 test set is accuracy of 0.9354.
code: https://github.com/tengshaofeng/ResidualAttentionNetwork-pytorch
@tengshaofeng thank you for share with us.
@josianerodrigues i use python3.6, tensorflow 1.4, keras 2.0.8.
@tengshaofeng thank you for sharing your code. Now I am fixing the model.
In your code ResidualAttentionNetwork-pytorch/Residual-Attention-Network/model/attention_module.py
, you add not only output of residual unit in skip-connection but also output of first residual unit
ex:
422l: out_interp = self.interpolation1(out_middle_2r_blocks) + out_down_residual_blocks1
Is it intentional? and is it effective?
@koichiro11,yes it is intentional. I reference the caffe project. You can remove the addition to try if it is effective.
@koichiro11 Hi! Can you share your results on cifar10 with us? Does this code reach the accuracy in the paper?
@tengshaofeng
when i run this code, it show me AttributeError: 'UpSampling2D' object has no attribute 'outbound_nodes'
i find that the #4
Do u give me some suggestions?
thanks.
@alyato , sorry, i have not met that problem.
@tengshaofeng Thanks. I dont know if the version is wrong.
tensorflow 1.4.0 keras 2.1.4 python 2.7
and some guys meet the same issue #4
I used the tensorflow version of the code to run to the 56th epoch and automatically ended ,the validation accuracy is just 80%, I don't know why.......................
Hi, @alyato
Have you fixed this problem? I have the same problem with python 3.6, tensorflow 1.10.0. Let me know if you got it done, much appreciated!
hi, @koichiro11 i am so appreciated with your great job. have you test the model in cifar-10, what is the result?