gradient_accumulation_steps + warmup_steps completely breaks the model's ability to learn

minimaxir / aitextgen

A robust Python tool for text-based AI training and generation using GPT-2.

MIT License

1.84k stars 220 forks source link

When training a model with both gradient_accumulation_steps and warmup_steps enabled, the network is completely unable to learn at all. For example, when using:

num_steps = 1000
gradient_accumulation_steps = 24
warmup_steps = 1

The model fails to learn a single thing. However, if you set:

warmup_steps = 0

The model looks quite different. I looked at the code, and it seems like both parameters are being passed to Pytorch-Lightning. So, maybe an upstream issue?

Anyway, logging the issue for posterity.

minimaxir / aitextgen

gradient_accumulation_steps + warmup_steps completely breaks the model's ability to learn #206