What is the purpose of the zero_head parameter?

google-research / big_transfer

Official repository for the "Big Transfer (BiT): General Visual Representation Learning" paper.

https://arxiv.org/abs/1912.11370

Apache License 2.0

1.5k stars 175 forks source link

What is the purpose of the zero_head parameter? #10

Open issamemari opened 4 years ago

issamemari commented 4 years ago

https://github.com/google-research/big_transfer/blob/6c83d6459e009fa89d84c1e904611e9b162e6eff/bit_pytorch/models.py#L165

Hi there! I'm wondering what is the purpose of this zero_head parameter. It seems to me that if it is set to True then the weights of the head are initialized to zero, which causes the network to always output zeros for whatever input, and renders any further fine tuning of the model useless.

Should this be replaced with random initialization? Or maybe removed altogether, which lets PyTorch takes care of initializing the head?

ademyanchuk commented 4 years ago

Hi. I was curious about it as well. I tried both approaches: zero init of head convolutional layer and let torch init it as usual. I reproduced few-shot example as in colab notebook. Both of the ways seems to learn reasonable weights and biases. But pytorch default init didn't achieve test result in range 78%-85%, only ~73%.

So, my guess that zero initialization is some sort of heuristic. Might be, it is easier (esp. for few-shot training) to learn weights for specific class features from zero in the first hand and do not try unlearning other weights and make them closer to zero.

Still, really interested how authors explain this. Thanks for your work!

lucasb-eyer commented 4 years ago

@ademyanchuk is correct, when doing any kind of training (such as fine-tuning), initializing the head to zero is common practice and stabilizes training. The OP statement of "and renders any further fine tuning of the model useless" is just wrong.

The only reason not to initialize it to zero is if you want to use our original pre-trained head, for example if you are interested in the ImageNet-21k class-space. Please see the colabs for examples of this.

abhiagwl4262 commented 3 years ago

@lucasb-eyer So what do you think, in what cases I should not initialize the head with zeros? Can you think of a problem statement where I should not initialize with zeros?