huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.73k stars 26.94k forks source link

How to fine-tune xlnet on SQuAD with the parameter setting provided in the paper? #1198

Closed mralexis1 closed 4 years ago

mralexis1 commented 5 years ago

From here on page 16, it seems we should set Layer-wise lr decay to 0.75. However, I didn't find a way to do so in run_squad.py. Could someone provide a sample command line that could run this fine-tune task with the given parameters?

Thanks!

hlums commented 5 years ago

Here is my attempt to do layer-wise lr decay. It didn't help with the model performance though. Fixing the preprocessing code helped a lot, but still a few points lower than what they reported in the paper and lower than BERT large WWM model. See my comment in #947

lr_layer_decay = 0.75
n_layers = 24
no_lr_layer_decay_group = []
lr_layer_decay_groups = {k:[] for k in range(n_layers)}
for n, p in model.named_parameters():
    name_split = n.split(".")
    if name_split[1] == "layer":
        lr_layer_decay_groups[int(name_split[2])].append(p) 
    else:
        no_lr_layer_decay_group.append(p)

optimizer_grouped_parameters = [{"params": no_lr_layer_decay_group, "lr": learning_rate}]
for i in range(n_layers):
    parameters_group = {"params": lr_layer_decay_groups[i], "lr": learning_rate * (lr_layer_decay ** (n_layers - i - 1))}
    optimizer_grouped_parameters.append(parameters_group)

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=1e-6)
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.