pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.28k stars 115 forks source link

Add a 3-stage PP config #345

Closed wconstab closed 2 weeks ago

wconstab commented 1 month ago

Stack from ghstack (oldest at bottom):

Pipelining is unique in that there is no need to stick to power-of-2 numbers of stages, and there maybe reasons an odd number is optimal depending on how you divide up your cluster.

Anyway, I use this for validation of the 1f1b schedule in a slightly-more-complicated than 2-stage but simpler than 4-stage setup.

seems to run fine, if run with an even batch size (--training.batch_size 12)