Closed nathan-az closed 7 months ago
Agreed, I noticed the same thing - specifically, even without any offloading, zero3 should shard the model across GPUs. So for example for llama 2 70b, using hf trainer deepspeed integration I get ~20GB VRAM used on each GPU on 8xA100 node after loading the model, whereas in this codebase I get OOM. Similarly for loading in 8bit I expect 10GB VRAM on each GPU but get 70GB VRAM on each GPU etc.
Bug report that I'm hoping to fix with https://github.com/huggingface/alignment-handbook/pull/51
Right now VRAM usage is high, even when using CPU offloading for parameters with ZeRO stage 3. My guess is that this is because if a GPU is detected, the model is being moved to it, regardless of deepspeed settings.