pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
2.25k stars 165 forks source link

Adding FSDP Memory Tracking and Estimation #425

Closed sanketpurandare closed 3 months ago

sanketpurandare commented 3 months ago

Stack from ghstack (oldest at bottom):

Command ( Disabling fake_mode_only will do an actual single GPU run with fake process group): ./run_memory_estimation.sh --memory_estimation.enabled --memory_estimation.disable_fake_mode

Output:

Screenshot 2024-06-21 at 6 03 58 PM
sanketpurandare commented 3 months ago

Previous conversation:

The change itself seems good to me. I wonder what the approach will be in the future if train.py continues to change though.

  1. I think the best way would be to incorporate this into train.py directly and use the estimate config options to enable and disable the right parts of the code in the main workflow. That way we don't have to maintain two copies.
  2. On another note what @gnadathur suggested and seems pretty reasonable is, we want the estimate.py to evolve into an option that auto configures stuff and outputs a configuration to run.
  3. For now I replicated some things because I haven't got user feedback for what they want. @lessw2020 is going to advertise this tool to partner teams, who may give us feedback about how they want to use it.

I am open to other suggestions as well.

cc: @awgu @tianyu-l

Originally posted by @sanketpurandare in https://github.com/pytorch/torchtitan/issues/424#issuecomment-2189676294