Closed jain-avi closed 6 years ago
@Neo96Mav , which network you used, or you have modified the network yourself based my code?
I have used your network and the official Caffe network for reference, and implemented my own small network. I am not using attention modules for 4x4 because I feel they are too small, and I am only using one attention module in 8x8. My network is relatively small, and its for CIFAR images only. Can you let me know the intuition behind this -
You have added output of residual block, as well as the output of the skip connection to the upsampled layer!
@Neo96Mav , this is refer to the caffe network, i think it is added for more detail information. You can remove it for testing the effectiveness.
Hi @Neo96Mav, Did you test the model using only one 8x8 Attention module? Was the accuracy better?
Hi @josianerodrigues , I added the 4x4 attention module as well. I am stuck at 89.5% accuracy. Maybe my model is not big enough or I am not using the exact same configuration, but I feel that it should not have affected it so much. @tengshaofeng Do u have any ideas why we can't match the authors performance?
@Neo96Mav , the paper only give the archietcture details of attention_92 for imagenet with 224 input but not for cifar10. So I build the net ResidualAttentionModel_92_32input following my understanding. I have tested on it on cifar10 test set, the result is as following: Accuracy of the model on the test images: 0.9354
maybe some details is not good. you can refer to the data preprocessed in the paper, keep same with the author. or maybe you can tune the hyper parameters for better performance. U can also remove the add operation to test the network.
@Neo96Mav @josianerodrigues the result now is 0.954
Can you tell me if your training and testing accuracies always followed each other? I am implementing a smaller and modified version of the network you coded, and my test accuracy seems to have stagnated at 81%. Also, I think you have coded a different architecture because you are adding output of pool layer as well as the output of pool+conv layer to the upsampled input, while the actual architecture only adds the pool+conv output to the upsampled layer. Is that making all the difference?