universome / class-norm

Class Normalization for Continual Zero-Shot Learning
34 stars 3 forks source link

Normalization usage? #4

Open JHLew opened 2 years ago

JHLew commented 2 years ago

Hi, thank you for the awesome work. I have a question on using class normalization. According to the 'class-norm-for-czsl.ipynb' file in this repo, ClassNorm(CN) seems to be applied in the following form:

FC - CN - ReLU - CN - FC - ReLU.

But to my intuition, this seems a little weird, since layers are stacked usually in the form of:

FC - Normalization - ReLU - FC - Normalization - ReLU.

The current form seems to have an activation layer between two Class-Norm layers, without any kind of Conv / FC layers. Is this intended? I have went through the paper, but could not find the answer, possibly due to my problem in understanding. Could you kindly clarify on this?

universome commented 2 years ago

Hi, thank you!

That's a very good question, and to be honest, I do not remember exactly what was the justification for us to make it this way. According to the theoretical exposition from the paper, we only need this standardization like this:

FC->Relu->Norm->FC->Relu

I suppose that at some point we decided to put it additionally in some other layers to be closer in spirit to batch normalization. And the only reasonable place we found for it to add was

FC->Norm->Relu->Norm->FC->Relu

Answering your question, it feels more like a coincidence that it is positioned this way (I agree that it looks strange). Interestingly, I just tried repositioning these normalization layers "normally" in a couple of ways and found that the performance becomes very bad. I think that it is due to the hyperparameters, but I will need time to dig deeper.

JHLew commented 2 years ago

Thank you for the explanation.

Some additional questions:

  1. Then are the results reported in the paper experimented in the form of FC->Norm->Relu->Norm->FC->Relu ?

  2. I get that Normalization needs to come before the FC layers theoretically (FC->Relu->Norm->FC->Relu). But when I first read the paper, I expected it would be a form of (Norm->FC->Relu->FC->Relu). In short, I expected ClassNorm to be in the very first part of the non-linear head, applied to the logits before going through non-linear layers, but seems like ClassNorm is applied in the middle of the non-linear layers.

I wanted to double check if I understood it wrong, and it is theorectically correct to be in the middle of the layers.