I am confused with the implementation of layerwise learning rate decay. It seems the depths ranges as [0, 1, ..., n_layers-1, n_layers, n_layers+2]. Why the depth of the task specific layer is set to n_layers+2 instead of n_layers+1? Are there any specific reason for this?
Hi @clarkkev ,
I am confused with the implementation of layerwise learning rate decay. It seems the depths ranges as
[0, 1, ..., n_layers-1, n_layers, n_layers+2]
. Why the depth of the task specific layer is set ton_layers+2
instead ofn_layers+1
? Are there any specific reason for this?https://github.com/google-research/electra/blob/79111328070e491b287c307906701ebc61091eb2/model/optimization.py#L181-L193
Cheers