Open HONGJINLYU opened 3 years ago
In fact, the detailed explanations are following this sentence. This is one experimental observation. It is like 'whack-a-mole' game. It is hard to generate large and negative outputs of BN to suppress all entries of F^l-1. In the results, there is always a possibility that the maximum value of F^l increasing gradually. To investigate this hypothesis, we did the experiment in figure 4.
@peiwang062 Hi, excellent work for explaining why ResNet with skip connections is not suitable for style transfer !
I also have a question. Why don't replace relu with leakyrelu in ResNet blocks ? Then the network should produce the negative outputs potentially, thus the large and positive residuals (R) may not be necessary, right ?
This is a great question. I guess using leakyrelu should work as well. We didn't talk about this possibility is that because our motivation is not to propose a new architecture backbone that works well on stylization, say like the last paragraph in section 3.4 to modify the original resnet, or your idea to use a different activation function. Instead, we aim to keep the standard network and just pursue a 'plug and play' way to mitigate the problem. But I like the idea to use leakyrelu on resnet if it can achieve the same performance as relu on all down stream tasks.
@peiwang062 Thanks for your detailed responses ! I indeed approve that the pluggable design is much more elegant that modifying the original architecture backbone. One more question, for a long time, I have been convinced that the reason why the ResNet is nefarious for stylization is that it uses much more BN layers, which makes the variance of the feature distribution trends to be normal, therefore Gram Matrix encodes less style patterns. Moreover, the most of style transfer models don't introduce normalization module in their encoder. I am inspired by SWAG for an other insight now. However, the BN layer seems to behave not very bad as I knows in this paper. I'm wondering, should the normalization layer be kept in the encoder, or can the model perform better by removing the normalization layer and combining it with SWAG ?
In our experiments, we found that there is no big difference for vgg with or without normalization, So we didn't explore too much on this line. And for resnet, if we remove normalization, this probably will contaminate the classification performance of it, make its training harder. This also deflects off our motivation. But it's maybe helpful to remove normalization for resnet as you are saying. It's hard to conclude without experiments.
Hi pei:
Very remarkable work.
And one sentence mentioned in the paper is hard for me to understand:
However, it may be impossible to generate a large negative residual for one channel without generat- ing large positive residuals for others
Could you explain more about this?
Best Regards HONGJIN