minimaxir / aitextgen

A robust Python tool for text-based AI training and generation using GPT-2.
https://docs.aitextgen.io
MIT License
1.84k stars 220 forks source link

gradient_accumulation_steps + warmup_steps completely breaks the model's ability to learn #206

Closed Vectorrent closed 1 year ago

Vectorrent commented 1 year ago

When training a model with both gradient_accumulation_steps and warmup_steps enabled, the network is completely unable to learn at all. For example, when using:

num_steps = 1000
gradient_accumulation_steps = 24
warmup_steps = 1

The model fails to learn a single thing. image However, if you set:

warmup_steps = 0

The model looks quite different. image I looked at the code, and it seems like both parameters are being passed to Pytorch-Lightning. So, maybe an upstream issue?

Anyway, logging the issue for posterity.

Vectorrent commented 1 year ago

I think I understand now. The warmup steps are a bit like revving the engine in your car; it's not about training the model at all. It's about encouraging the user to create a dedicated warmup step, where you bring a GPU into "parallel processing" mode, and synchronize with the weights you're about to feed. This is a quantum process, of course, and should be run in a dedicated step, before training ever begins. This probably doesn't make sense to you, but it makes sense to me, so I'll just close the issue for now.