Closed avinashkarani closed 11 months ago
Hi @avinashkarani! Yes, that's expected for now. I'll work soon on adding support for DeepSpeed-inference so that we can use parallelism.
Hi @regisss , any expected timeline for this feature support. we would like to demonstrate Llama2 performance scaling with Multiple Gaudi instances (AWS EC2 Gaudi1- 8 HPU's , Intel developer Cloud with Gaudi2-8 HPU's) for at least 10 simultaneously users.. hence this urgent request..
@premraot When do you need it?
@regisss currently this a blocker for us. once we were able to scale, we would like to take this to our production environment.
Okay, I'll look into it in the next few days but it may require more time as it could be not straightforward at all.
Hello @regisss , just wanted to check if you have any timeline for Deep speed integration.
@avinashkarani Hard to say as there are many things to do, but the goal is to work on it within the next 2 weeks.
@avinashkarani It took longer than expected but https://github.com/huggingface/optimum-habana/pull/485 should enable to use DeepSpeed-inference and sharded models
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Log: 2023-08-31T19:25:28.057753Z INFO text_generation_launcher: Sharding model on 2 processes 2023-08-31T19:25:28.057837Z INFO download: text_generation_launcher: Starting download process. 2023-08-31T19:25:35.148837Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-08-31T19:25:35.468574Z INFO download: text_generation_launcher: Successfully downloaded weights. 2023-08-31T19:25:35.468985Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2023-08-31T19:25:35.468987Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2023-08-31T19:25:45.485025Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2023-08-31T19:25:45.485090Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2023-08-31T19:26:05.515041Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2023-08-31T19:26:05.515110Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2023-08-31T19:26:11.027830Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2023-08-31T19:26:11.123093Z INFO shard-manager: text_generation_launcher: Shard ready in 35.652902762s rank=0 2023-08-31T19:26:11.312428Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2023-08-31T19:26:15.529479Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
Expected behavior
Goal is to run Llama2 7B model with TGI and connect it with Langchain. with one HPU performance is very less, trying to leverage multiple hpu's to improve performance.