universome / class-norm

Class Normalization for Continual Zero-Shot Learning
34 stars 3 forks source link

Why use additional multiplication of np.sqrt(attrs.shape[1]) in attribute normalization? #3

Open TrynBug opened 3 years ago

TrynBug commented 3 years ago

Attributes Normalization (AN) in the paper is as follows:

But the code uses Attributes Normalization (AN) like this:

is the dimensionality of attribute vector. And the code for attributes normalization that I found is (at the preprocessing part of the class-norm-for-czsl.ipynb file) attrs = attrs / attrs.norm(dim=1, keepdim=True) * np.sqrt(attrs.shape[1])

I couldn't find about this additional multiplication in the paper. And it seems to have a huge influence on performance. Can you tell me why this is used?

universome commented 3 years ago

Hi! We added this to avoid possible technical optimization issues that might arise from dividing attributes by a too large number which would squash their values: neural networks like it when their inputs are coming from N(0, I) and doing "a_c / ||a_c|| * sqrt(d_a)" in this case is conceptually similar to standardization. Attributes normalization on its own is not supposed to work for deep models and we left it their to be more consistent with the linear case.

I believe that (for deep models) it is possible to replace all this with simple standardization (i.e. doing (a_c - mean(a_c)) / std(a_c) — but I just tried this (without any other changes) and it worked poorly, so I suspect it is necessary to adjust the optimization hyperparams (learning rate, weight decay, scheduler, etc.) somehow. Also, there might be some other explanation but I will need to think about it.

P.S. Sorry for my late reply