Closed tianyu-l closed 3 months ago
hmm. are you saying it’s dumping .pickle every iteration or just record more allocation/free? pickle.dump is costly but recording is cheap base on my previous observation in xlformers
hmm. are you saying it’s dumping .pickle every iteration or just record more allocation/free? pickle.dump is costly but recording is cheap base on my previous observation in xlformers
@weifengpy hmm sorry I think I'm wrong here.
I thought the dumping is slow because we are recording every iteration so the dumping time would be proportional to the number of train steps. But I just tested that with 5/10/20 steps the dumping always costs ~20 seconds. Is it true that no matter how large the dump file is, the time per dumping is invariant?
Either way I think 20 seconds is still a bit too long for the debug model, because sometimes people would like to iterate fast on it to experiment changes.
hmm. are you saying it’s dumping .pickle every iteration or just record more allocation/free? pickle.dump is costly but recording is cheap base on my previous observation in xlformers
@weifengpy hmm sorry I think I'm wrong here.
I thought the dumping is slow because we are recording every iteration so the dumping time would be proportional to the number of train steps. But I just tested that with 5/10/20 steps the dumping always costs ~20 seconds. Is it true that no matter how large the dump file is, the time per dumping is invariant?
Either way I think 20 seconds is still a bit too long for the debug model, because sometimes people would like to iterate fast on it to experiment changes.
I am fine with disable by default.
cuda snapshot only maintains last-N memory allocation/free. step 5 is the same as step 10. Currently N is small and can only keep track of 2 or 3 steps
Stack from ghstack (oldest at bottom):
Currently memory profiler would profile and dump memory snapshots for every single iteration (for details see https://github.com/pytorch/torchtitan/pull/395#pullrequestreview-2121075514). Even for
debug_model
, this costs around 20 seconds. Let's disable it by default, until perprof_freq
profiling and dumping is enabled (tracked in #422).