Closed mralexis1 closed 4 years ago
Here is my attempt to do layer-wise lr decay. It didn't help with the model performance though. Fixing the preprocessing code helped a lot, but still a few points lower than what they reported in the paper and lower than BERT large WWM model. See my comment in #947
lr_layer_decay = 0.75
n_layers = 24
no_lr_layer_decay_group = []
lr_layer_decay_groups = {k:[] for k in range(n_layers)}
for n, p in model.named_parameters():
name_split = n.split(".")
if name_split[1] == "layer":
lr_layer_decay_groups[int(name_split[2])].append(p)
else:
no_lr_layer_decay_group.append(p)
optimizer_grouped_parameters = [{"params": no_lr_layer_decay_group, "lr": learning_rate}]
for i in range(n_layers):
parameters_group = {"params": lr_layer_decay_groups[i], "lr": learning_rate * (lr_layer_decay ** (n_layers - i - 1))}
optimizer_grouped_parameters.append(parameters_group)
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=1e-6)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
From here on page 16, it seems we should set Layer-wise lr decay to 0.75. However, I didn't find a way to do so in
run_squad.py
. Could someone provide a sample command line that could run this fine-tune task with the given parameters?Thanks!