pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.22k stars 412 forks source link

[RFC] Optimizer CPU offload from torchao for single GPU low memory config #1278

Open gau-nernst opened 2 months ago

gau-nernst commented 2 months ago

The recent addition of optimizer CPU offload in torchao can be useful for single GPU low memory config.

https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

In my brief testing https://github.com/pytorch/torchtune/compare/main...gau-nernst:torchtune:optim_offload, there is ~25% increase in tok/s. Wandb project: https://wandb.ai/gau-nernst/torchtune. My system: 4070Ti SUPER (16GB VRAM), Ryzen 5600, DDR4.

image

There is also a difference in handling gradients memory.

Regarding memory usage, it's pretty strange since in nvidia-smi, paged Adam run also occupies a lot of memory (near 16GB). Perhaps because bnb manages their own unified memory so PyTorch doesn't report it? Also, for RAM usage, htop reports 55.5GB for paged Adam, and 64.1GB for offload Adam.

We probably need more testing. In particular:

Regardless, I think adding an extra option for low memory single GPU training is beneficial, even if it is not well-tested yet.

cc @msaroufim

gau-nernst commented 2 months ago

Profiling trace for CPU offload Adam, bs=4, Llama2-7B

image

From the screenshot

For future extension. Since CPU offload Adam already keeps a copy of params on CPU, we can extend this and implement:

Both of these approaches aim to reduce memory footprint from params, thus we can use larger batch size -> more work for GPU, while keeping same amount of work for CPU, to reduce CPU Adam bottleneck.

Update: proof-of-concept for mixed-precision training. Keep FP8 E4M3 params on GPU (except embedding layer and LM head), BF16 params on CPU, computation is still in BF16 (weights are upcast to BF16). Increase batch size to improve throughput. Using 4070Ti SUPER, tok/s match 4090 w/ paged Adam from torchtune README. Accuracy issue will probably need extensive experiments and investigations.

image
gau-nernst commented 2 months ago

Benchmarks with Phi3-mini 4B bs=16. 30% improvements.

image
ebsmothers commented 2 months ago

This is great @gau-nernst! @msaroufim was just telling me about this, thanks for getting a prototype out with results so fast. I agree we should test it out a bit more but I don't see any harm in supporting it as a flag for users to play with. The bit about slightly higher memory util compared to bnb is one area I'd want to understand better, have you observed cases where optimizer offload OOMs but bnb PagedAdam doesn't? Also cc @SalmanMohammadi who's been putting together a tutorial on different memory and perf levers we can pull

gau-nernst commented 2 months ago

have you observed cases where optimizer offload OOMs but bnb PagedAdam doesn't

From my limited testing, I haven't observed such cases. Seems like they both OOM at the same batch size.

Should I open a PR now, or you want to do your own testing with my branch first? My branch is just a quick hack together at the moment, will iron out the details when I create a PR.

ebsmothers commented 2 months ago

@gau-nernst a PR would be great. I can do a bit of testing myself in parallel, but that way we can also frontload any potential design discussions for how we expose this in our recipes

gau-nernst commented 2 months ago

CPUOffloadOptimizer has the following signature:

class CPUOffloadOptimizer:
    def __init__(self, params, optimizer_class: Type[Optimizer], *, offload_gradients: bool = False, **kwargs) -> None:
        ...

Ideally offload_gradients should be exposed to the user too. The tricky part is that the config parser cannot parse optimizer_class from string. I'm thinking of the following designs:

  1. Add an offload_optimizer flag (similar to optimizer_in_bwd): kinda tricky to expose offload_gradients. Adding an extra flag seems messy.
  2. Make a custom CPU offload Adam/AdamW class in torchtune: users just need to replace optimizer._component_. offload_gradients is naturally exposed. Downside is that we have to write a custom CPU offload optimizer class for each base optimizer.
  3. Make a light wrapper around CPUOffloadOptimizer that parses string to optimizer_class, using the existing _get_component_from_path(). Users only need to replace optimizer._component_ like above. Seems to be the cleanest solution.

The benefits of offload_gradients=False is that we can do gradient accumulation. I will test with Phi3-mini 4B if using that option can improve tok/s over offload_gradients=True. If not, maybe we don't need to expose offload_gradients at all.

gau-nernst commented 2 months ago

Tested offload_gradients=False for Phi3-mini 4B on my machine. The speed is terrible, because batch size is now limited to 1, so the training is bandwidth bound. Anyway, I will go with the approach 3. in my previous comment.

Another note. Since CPUOffloadOptimizer is only available in torchao main branch now, should I wait until this feature makes it to ao 0.5.0 release (I think 1 month from now?), or it's ok to include nightly feature from torchao.