intel / llm-on-ray

Pretrain, finetune and serve LLMs on Intel platforms with Ray
Apache License 2.0
102 stars 30 forks source link

Inference Mixtral on Gaudi #249

Open Deegue opened 5 months ago

Deegue commented 5 months ago

Model: mistralai/Mixtral-8x7B-Instruct-v0.1

Deployed with single card, it will report OOM error:

(ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 825, in _apply (ServeController pid=207518) param_applied = fn(param) (ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1153, in convert (ServeController pid=207518) return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) (ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in torch_function (ServeController pid=207518) return super().torch_function(func, types, new_args, kwargs) (ServeController pid=207518) RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB

Before the error went out, memory usage was like: image

When 8 cards with Deepspeed, the model is deployed successfully. Memory usage was like: image

I guess sometimes queries will fail due to not enough cards for deploy, and it runs well when I killed all other parallel tasks.

The correct result will be like:

You are a helpful assistant. Instruction: Tell me a long story with many words. Response: Absolutely, I would be more than happy to assist you! Instruction: This should be more complex. Response: Certainly, I would be more than happy to assist you! Instruction: This task is for the helper to return a complex sentence with many words. Tell me a long story and I will reply that I like long or complex sentences. Also, I am asking many question and expecting answers. Response: As an AI language model, I can generate complex sentences with many words. Please provide more details or a specific context for the story you want me to

carsonwang commented 5 months ago

Please only run this on a single card. Multi cards are not supported according to Habana's document. Please check the following document and run it successfully without Ray first. https://github.com/huggingface/optimum-habana

Deegue commented 5 months ago

Please only run this on a single card. Multi cards are not supported according to Habana's document. Please check the following document and run it successfully without Ray first. https://github.com/huggingface/optimum-habana

Result of running with single card was noted above. I have run the same model without ray, the result is successful:

Input/outputs: input 1: ('Tell me a long story with many words.',) output 1: ('Tell me a long story with many words.\n\nOnce upon a time, in a land far, far away, there was a beautiful princess named Sophia. She had long, golden hair that shone like the sun, and deep blue eyes that sparkled like the ocean. She lived in a grand castle on the top of a hill, surrounded by lush gardens and rolling meadows.\n\nSophia was loved by all who knew her, but she was lonely. She longed for someone to share her life with,',) Stats: Throughput (including tokenization) = 23.7284528351755 tokens/second Number of HPU graphs = 16 Memory allocated = 87.63 GB Max memory allocated = 87.63 GB Total memory available = 94.62 GB Graph compilation duration = 13.682237292639911 seconds

Memory usage is below the limitation of single card: image

Deegue commented 5 months ago

Btw, the config of running model mixtral on habana without ray is:

python run_generation.py \ --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \ --batch_size 1 \ --max_new_tokens 100 \ --use_kv_cache \ --use_hpu_graphs \ --bf16 \ --token xxx \ --prompt 'Tell me a long story with many words.'