Closed wconstab closed 3 months ago
Stack from ghstack (oldest at bottom):
This is useful for PP when more layers == more possibilities for schedules/num_stages, but we don't care about having a large model in terms of #parameters
Stack from ghstack (oldest at bottom):
411
358
406
This is useful for PP when more layers == more possibilities for schedules/num_stages, but we don't care about having a large model in terms of #parameters