Add a 3-stage PP config

Stack from ghstack (oldest at bottom):

-> #345
344
354

Pipelining is unique in that there is no need to stick to power-of-2 numbers of stages, and there maybe reasons an odd number is optimal depending on how you divide up your cluster.

Anyway, I use this for validation of the 1f1b schedule in a slightly-more-complicated than 2-stage but simpler than 4-stage setup.

seems to run fine, if run with an even batch size (--training.batch_size 12)

pytorch / torchtitan

Add a 3-stage PP config #345

344

354