Qwen2-72B inference on 8x Gaudi2 gets OOM issue due to missing meta-device support on model loading

LeoZhao-Intel commented 2 months ago

System Info

optimum-habana 1.12.0
synapseAI: 1.16.1

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

step to reproduce:

deepspeed --num_nodes 1 --num_gpus 8 --master_addr 127.0.0.1 --master_port 60008 run_generation.py \ --model_name_or_path /data/Qwen2-72B-Instruct \ --max_input_tokens 1024 \ --max_new_tokens 512 \ --batch_size 1 \ --n_iterations 2 \ --warmup 3 \ --bf16 \ --use_hpu_graphs \ --use_kv_cache \ --reuse_cache \ --use_flash_attention \ --trim_logits \ --limit_hpu_graphs \ --bucket_internal \ --bucket_size 512

it will trigger system OOM issue on system with 1TB memory instead of Gaudi2 OOM issue.

the rootcause is current optimum-habana and deepspeed still not support meta device loading model, which make system OOM in model loading stage.

walkaround is using 4x Gaudi2 instead of 8x to avoid system OOM.

Expected behavior

qwen2-72B inference on 8x Gaudi2 can run on system with 1TB memory.

regisss commented 1 month ago

I think here we are constrained by DeepSpeed unfortuantely :/

LeoZhao-Intel commented 1 month ago

understood, we need both changes in optimum-habana and deepspeed-fork. I will try to submit JIRA through internal system.

LeoZhao-Intel commented 1 month ago

https://github.com/huggingface/optimum-habana/pull/1151

fix for this issue, which also depends on new deepspeed version.

regisss commented 3 weeks ago

@LeoZhao-Intel Now that Synapse 1.17 is out and #1163 has been merged, this should work.

LeoZhao-Intel commented 3 weeks ago

sure, will verify on latest 1.17

LeoZhao-Intel commented 2 weeks ago

verified on 1.17, it is fixed.

huggingface / optimum-habana