Question about layerwise learning rate decay

google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Apache License 2.0

2.31k stars 351 forks source link

Question about layerwise learning rate decay #64

Closed TianyuZhuuu closed 4 years ago

TianyuZhuuu commented 4 years ago

Hi @clarkkev ,

I am confused with the implementation of layerwise learning rate decay. It seems the depths ranges as [0, 1, ..., n_layers-1, n_layers, n_layers+2]. Why the depth of the task specific layer is set to n_layers+2 instead of n_layers+1? Are there any specific reason for this?

https://github.com/google-research/electra/blob/79111328070e491b287c307906701ebc61091eb2/model/optimization.py#L181-L193

Cheers

clarkkev commented 4 years ago

Hi, this is a small mistake in the code, see https://github.com/google-research/electra/issues/51

TianyuZhuuu commented 4 years ago

Thanks!