Open gau-nernst opened 2 months ago
Profiling trace for CPU offload Adam, bs=4, Llama2-7B
From the screenshot
For future extension. Since CPU offload Adam already keeps a copy of params on CPU, we can extend this and implement:
Both of these approaches aim to reduce memory footprint from params, thus we can use larger batch size -> more work for GPU, while keeping same amount of work for CPU, to reduce CPU Adam bottleneck.
Update: proof-of-concept for mixed-precision training. Keep FP8 E4M3 params on GPU (except embedding layer and LM head), BF16 params on CPU, computation is still in BF16 (weights are upcast to BF16). Increase batch size to improve throughput. Using 4070Ti SUPER, tok/s match 4090 w/ paged Adam from torchtune README. Accuracy issue will probably need extensive experiments and investigations.
Benchmarks with Phi3-mini 4B bs=16. 30% improvements.
This is great @gau-nernst! @msaroufim was just telling me about this, thanks for getting a prototype out with results so fast. I agree we should test it out a bit more but I don't see any harm in supporting it as a flag for users to play with. The bit about slightly higher memory util compared to bnb is one area I'd want to understand better, have you observed cases where optimizer offload OOMs but bnb PagedAdam doesn't? Also cc @SalmanMohammadi who's been putting together a tutorial on different memory and perf levers we can pull
have you observed cases where optimizer offload OOMs but bnb PagedAdam doesn't
From my limited testing, I haven't observed such cases. Seems like they both OOM at the same batch size.
Should I open a PR now, or you want to do your own testing with my branch first? My branch is just a quick hack together at the moment, will iron out the details when I create a PR.
@gau-nernst a PR would be great. I can do a bit of testing myself in parallel, but that way we can also frontload any potential design discussions for how we expose this in our recipes
CPUOffloadOptimizer
has the following signature:
class CPUOffloadOptimizer:
def __init__(self, params, optimizer_class: Type[Optimizer], *, offload_gradients: bool = False, **kwargs) -> None:
...
Ideally offload_gradients
should be exposed to the user too. The tricky part is that the config parser cannot parse optimizer_class
from string. I'm thinking of the following designs:
offload_optimizer
flag (similar to optimizer_in_bwd
): kinda tricky to expose offload_gradients
. Adding an extra flag seems messy.optimizer._component_
. offload_gradients
is naturally exposed. Downside is that we have to write a custom CPU offload optimizer class for each base optimizer.CPUOffloadOptimizer
that parses string to optimizer_class
, using the existing _get_component_from_path()
. Users only need to replace optimizer._component_
like above. Seems to be the cleanest solution.The benefits of offload_gradients=False
is that we can do gradient accumulation. I will test with Phi3-mini 4B if using that option can improve tok/s over offload_gradients=True
. If not, maybe we don't need to expose offload_gradients
at all.
Tested offload_gradients=False
for Phi3-mini 4B on my machine. The speed is terrible, because batch size is now limited to 1, so the training is bandwidth bound. Anyway, I will go with the approach 3. in my previous comment.
Another note. Since CPUOffloadOptimizer
is only available in torchao main branch now, should I wait until this feature makes it to ao 0.5.0
release (I think 1 month from now?), or it's ok to include nightly feature from torchao.
The recent addition of optimizer CPU offload in torchao can be useful for single GPU low memory config.
https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload
In my brief testing https://github.com/pytorch/torchtune/compare/main...gau-nernst:torchtune:optim_offload, there is ~25% increase in tok/s. Wandb project: https://wandb.ai/gau-nernst/torchtune. My system: 4070Ti SUPER (16GB VRAM), Ryzen 5600, DDR4.
There is also a difference in handling gradients memory.
offload_gradients=True
inCPUOffloadOptimizer
, which free gradients once device-to-host transfer finishes.optimizer_in_bwd=True
.Regarding memory usage, it's pretty strange since in nvidia-smi, paged Adam run also occupies a lot of memory (near 16GB). Perhaps because bnb manages their own unified memory so PyTorch doesn't report it? Also, for RAM usage, htop reports 55.5GB for paged Adam, and 64.1GB for offload Adam.
We probably need more testing. In particular:
expandable_segments:True
to prevent OOM in the middle of training. Memory spike behavior might be unpredictable with CPU offload Adam, since it is not well tested. The spike might come from gradients offloading (ref: https://github.com/pytorch/ao/pull/584#discussion_r1704667190, not 100% sure). I haven't tested paged Adam withoutexpandable_segments:True
yet.Regardless, I think adding an extra option for low memory single GPU training is beneficial, even if it is not well-tested yet.
cc @msaroufim