text_generation_launcher: Waiting for shard to be ready... rank=1 forever if we pass --num-shard

avinashkarani commented 1 year ago

System Info

model=bigscience/bloom-560m  (same issue with 
docker run -p 8080:80 -v $volume:/data --runtime=habana --privileged -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=$token  -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi  --model-id $model --num-shard 2

when running without --num-shard its works fine but seems to be using one Gaudi hpu.
Instance : Ec2 DL1 instance

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Log: 2023-08-31T19:25:28.057753Z INFO text_generation_launcher: Sharding model on 2 processes 2023-08-31T19:25:28.057837Z INFO download: text_generation_launcher: Starting download process. 2023-08-31T19:25:35.148837Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-08-31T19:25:35.468574Z INFO download: text_generation_launcher: Successfully downloaded weights. 2023-08-31T19:25:35.468985Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2023-08-31T19:25:35.468987Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2023-08-31T19:25:45.485025Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2023-08-31T19:25:45.485090Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2023-08-31T19:26:05.515041Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2023-08-31T19:26:05.515110Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2023-08-31T19:26:11.027830Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-08-31T19:26:11.123093Z INFO shard-manager: text_generation_launcher: Shard ready in 35.652902762s rank=0 2023-08-31T19:26:11.312428Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-08-31T19:26:15.529479Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1

Expected behavior

Goal is to run Llama2 7B model with TGI and connect it with Langchain. with one HPU performance is very less, trying to leverage multiple hpu's to improve performance.

regisss commented 1 year ago

Hi @avinashkarani! Yes, that's expected for now. I'll work soon on adding support for DeepSpeed-inference so that we can use parallelism.

premraot commented 1 year ago

Hi @regisss , any expected timeline for this feature support. we would like to demonstrate Llama2 performance scaling with Multiple Gaudi instances (AWS EC2 Gaudi1- 8 HPU's , Intel developer Cloud with Gaudi2-8 HPU's) for at least 10 simultaneously users.. hence this urgent request..

regisss commented 1 year ago

@premraot When do you need it?

avinashkarani commented 1 year ago

@regisss currently this a blocker for us. once we were able to scale, we would like to take this to our production environment.

regisss commented 1 year ago

Okay, I'll look into it in the next few days but it may require more time as it could be not straightforward at all.

avinashkarani commented 1 year ago

Hello @regisss , just wanted to check if you have any timeline for Deep speed integration.

regisss commented 1 year ago

@avinashkarani Hard to say as there are many things to do, but the goal is to work on it within the next 2 weeks.

regisss commented 11 months ago

@avinashkarani It took longer than expected but https://github.com/huggingface/optimum-habana/pull/485 should enable to use DeepSpeed-inference and sharded models

huggingface / optimum-habana