For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.
Stack from ghstack (oldest at bottom):
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.