huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.96k stars 1.06k forks source link

No longer can shard Llama-2-7b-chat-hf model #1106

Closed penguin-jeff closed 8 months ago

penguin-jeff commented 1 year ago

System Info

TGI Version - 1.0.3 Model - meta-llama/Llama-2-7b-chat-hf

AWS instance - g5.12xlarge OS version - Ubuntu 22.04.3 LTS

NVIDIA SMI NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 GPU - NVIDIA A10G (x4)

Docker Docker version 24.0.6, build ed223bc

Information

Tasks

Reproduction

Run the following docker command on a g5.12xlarge in AWS

docker run -d --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -p 8080:80 -v $HOME/tgi/data:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id meta-llama/Llama-2-7b-chat-hf --num-shard 2 --max-total-tokens 4096

Error logs after running docker logs container_id

2023-10-05T22:16:51.388477Z  INFO text_generation_launcher: Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0be6da9191cc", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-05T22:16:51.388512Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-10-05T22:16:51.388604Z  INFO download: text_generation_launcher: Starting download process.
2023-10-05T22:16:54.982929Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-10-05T22:16:55.592752Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-05T22:16:55.593045Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-05T22:16:55.593097Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-10-05T22:17:05.602246Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-05T22:17:05.602348Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-10-05T22:17:15.610219Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-05T22:17:15.610465Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-10-05T22:17:25.618409Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-05T22:17:25.618579Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-10-05T22:17:35.626514Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-05T22:17:35.626746Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-10-05T22:17:45.634494Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-05T22:17:45.634699Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-10-05T22:17:55.642496Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-05T22:17:55.642700Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-10-05T22:18:05.650446Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-05T22:18:05.650749Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-10-05T22:18:10.854898Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69027 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69027 milliseconds before timing out. rank=0
2023-10-05T22:18:10.854939Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0
2023-10-05T22:18:10.949666Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-05T22:18:10.949695Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
2023-10-05T22:18:11.008783Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1

Expected behavior

The model is sharded across 2 GPUs and the server starts.

Note: This was working for me until around September 28th for some reason. I have also tried this with 1.1.0 with the same issue as well as on a new g5.12xlarge instance.

Narsil commented 1 year ago
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69027 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Just retry, NCCL is a bit picky at times. But this isn't an issue in TGI itself. Try not running docker in daemon mode maybe ? (Maybe there's an old NCCL process still up somewhere messing up with the current process)

penguin-jeff commented 1 year ago

After running multiple times I ended up receiving a slightly different error message with an extra line thrown in, but I'm not sure how much of help this may be.

2023-10-05T23:40:07.398435Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

-> [W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69108 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 69108 milliseconds before timing out. rank=1
Narsil commented 1 year ago

-> [W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).

This is some process that's already using that port. Something is not cleaning up properly when it finishes it your setup.

You can try playing around with those settings to find an used port: https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#shardudspath

(Probably better to clean whatever is using those ports if you can though)

rakesh-krishna commented 1 year ago

@penguin-jeff Hi, I am getting the same error but with WizardLM/WizardLM-70B-V1.0. Got any solution ?

first error

INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68089 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68089 milliseconds before timing out. rank=2
ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=2
ERROR text_generation_launcher: Shard 2 failed to start
INFO text_generation_launcher: Shutting down shards
ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Error 2

You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68122 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 68122 milliseconds before timing out. rank=3
ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=3

also tried with by setting different --master-port still got the error2

ZQ-Dev8 commented 12 months ago

Helping to bump the issue. I am also unable to load Llama-2-7b-chat with 4 shards. Strangely enough, 1-2 shards works fine. This is on a node with 4xA6000s.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Narsil commented 8 months ago

Strangely enough, 1-2 shards works fine. This is on a node with 4xA6000s.

Really seems like an issue with your provider/host. Contact them I think.