pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.28k stars 115 forks source link

only produce tensorboard logs on rank 0 by default #339

Closed tianyu-l closed 1 month ago

tianyu-l commented 1 month ago

Stack from ghstack (oldest at bottom):

For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

tianyu-l commented 1 month ago

not sure why the 1D compile test is failing...