High VRAM usage with ZeRO 3.

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

https://huggingface.co/HuggingFaceH4

Apache License 2.0

4.28k stars 367 forks source link

High VRAM usage with ZeRO 3. #58

Closed nathan-az closed 7 months ago

nathan-az commented 8 months ago

Bug report that I'm hoping to fix with https://github.com/huggingface/alignment-handbook/pull/51

Right now VRAM usage is high, even when using CPU offloading for parameters with ZeRO stage 3. My guess is that this is because if a GPU is detected, the model is being moved to it, regardless of deepspeed settings.

orendar commented 8 months ago

Agreed, I noticed the same thing - specifically, even without any offloading, zero3 should shard the model across GPUs. So for example for llama 2 70b, using hf trainer deepspeed integration I get ~20GB VRAM used on each GPU on 8xA100 node after loading the model, whereas in this codebase I get OOM. Similarly for loading in 8bit I expect 10GB VRAM on each GPU but get 70GB VRAM on each GPU etc.

nathan-az commented 7 months ago

Fixed by https://github.com/huggingface/alignment-handbook/pull/51