Closed xrsrke closed 7 months ago
Suppose you start training with a pipeline parallel size of 4. We need to make it supports resuming training with a different pipeline parallel size, like 2, by merging optimizer states.
Suppose you start training with a pipeline parallel size of 4. We need to make it supports resuming training with a different pipeline parallel size, like 2, by merging optimizer states.