by default disable heavy memory profiling

pytorch / torchtitan

A native PyTorch Library for large model training

BSD 3-Clause "New" or "Revised" License

2.25k stars 165 forks source link

by default disable heavy memory profiling #430

Closed tianyu-l closed 3 months ago

tianyu-l commented 3 months ago

Stack from ghstack (oldest at bottom):

-> #430

Currently memory profiler would profile and dump memory snapshots for every single iteration (for details see https://github.com/pytorch/torchtitan/pull/395#pullrequestreview-2121075514). Even for debug_model, this costs around 20 seconds. Let's disable it by default, until per prof_freq profiling and dumping is enabled (tracked in #422).

weifengpy commented 3 months ago

hmm. are you saying it’s dumping .pickle every iteration or just record more allocation/free? pickle.dump is costly but recording is cheap base on my previous observation in xlformers

tianyu-l commented 3 months ago

hmm. are you saying it’s dumping .pickle every iteration or just record more allocation/free? pickle.dump is costly but recording is cheap base on my previous observation in xlformers

@weifengpy hmm sorry I think I'm wrong here.

I thought the dumping is slow because we are recording every iteration so the dumping time would be proportional to the number of train steps. But I just tested that with 5/10/20 steps the dumping always costs ~20 seconds. Is it true that no matter how large the dump file is, the time per dumping is invariant?

Either way I think 20 seconds is still a bit too long for the debug model, because sometimes people would like to iterate fast on it to experiment changes.

weifengpy commented 3 months ago

hmm. are you saying it’s dumping .pickle every iteration or just record more allocation/free? pickle.dump is costly but recording is cheap base on my previous observation in xlformers

@weifengpy hmm sorry I think I'm wrong here.

I thought the dumping is slow because we are recording every iteration so the dumping time would be proportional to the number of train steps. But I just tested that with 5/10/20 steps the dumping always costs ~20 seconds. Is it true that no matter how large the dump file is, the time per dumping is invariant?

Either way I think 20 seconds is still a bit too long for the debug model, because sometimes people would like to iterate fast on it to experiment changes.

I am fine with disable by default.

cuda snapshot only maintains last-N memory allocation/free. step 5 is the same as step 10. Currently N is small and can only keep track of 2 or 3 steps