Pipelining is unique in that there is no need to stick to power-of-2
numbers of stages, and there maybe reasons an odd number is optimal
depending on how you divide up your cluster.
Anyway, I use this for validation of the 1f1b schedule in a slightly-more-complicated
than 2-stage but simpler than 4-stage setup.
seems to run fine, if run with an even batch size
(--training.batch_size 12)
Stack from ghstack (oldest at bottom):
344
354
Pipelining is unique in that there is no need to stick to power-of-2 numbers of stages, and there maybe reasons an odd number is optimal depending on how you divide up your cluster.
Anyway, I use this for validation of the 1f1b schedule in a slightly-more-complicated than 2-stage but simpler than 4-stage setup.
seems to run fine, if run with an even batch size (
--training.batch_size 12
)