Open Deegue opened 5 months ago
Please only run this on a single card. Multi cards are not supported according to Habana's document. Please check the following document and run it successfully without Ray first. https://github.com/huggingface/optimum-habana
Please only run this on a single card. Multi cards are not supported according to Habana's document. Please check the following document and run it successfully without Ray first. https://github.com/huggingface/optimum-habana
Result of running with single card was noted above. I have run the same model without ray, the result is successful:
Input/outputs: input 1: ('Tell me a long story with many words.',) output 1: ('Tell me a long story with many words.\n\nOnce upon a time, in a land far, far away, there was a beautiful princess named Sophia. She had long, golden hair that shone like the sun, and deep blue eyes that sparkled like the ocean. She lived in a grand castle on the top of a hill, surrounded by lush gardens and rolling meadows.\n\nSophia was loved by all who knew her, but she was lonely. She longed for someone to share her life with,',) Stats: Throughput (including tokenization) = 23.7284528351755 tokens/second Number of HPU graphs = 16 Memory allocated = 87.63 GB Max memory allocated = 87.63 GB Total memory available = 94.62 GB Graph compilation duration = 13.682237292639911 seconds
Memory usage is below the limitation of single card:
Btw, the config of running model mixtral on habana without ray is:
python run_generation.py \ --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \ --batch_size 1 \ --max_new_tokens 100 \ --use_kv_cache \ --use_hpu_graphs \ --bf16 \ --token xxx \ --prompt 'Tell me a long story with many words.'
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
Deployed with single card, it will report OOM error:
Before the error went out, memory usage was like:
When 8 cards with Deepspeed, the model is deployed successfully. Memory usage was like:
I guess sometimes queries will fail due to not enough cards for deploy, and it runs well when I killed all other parallel tasks.
The correct result will be like: