huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.36k stars 948 forks source link

get stucked when run text-generation-benchmark on AMD gpu #2077

Open yuqie opened 3 weeks ago

yuqie commented 3 weeks ago

System Info

Target: x86_64-unknown-linux-gnu Cargo version: 1.78.0 Commit sha: 96b7b40ca3e39f7ca5b875bff9a4665c1b175289 Docker label: sha-96b7b40-rocm

Information

Tasks

Reproduction

I followed the step from website https://github.com/huggingface/hf-rocm-benchmark

  1. the docker container, a local model is used and the server is setup successfully.
    docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 256g \
    --net host -v $(pwd)/hf_cache:/data -e HUGGING_FACE_HUB_TOKEN=$HF_READ_TOKEN \
    ghcr.io/huggingface/text-generation-inference:sha-293b8125-rocm \
    --model-id local_path/Meta-Llama-70B-Instruct --num-shard 8
  2. open another shell:docker exec -it tgi_container_name /bin/bash
  3. run the benchmark
    text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
    --sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
    -b 1 -b 2 

    and it stucked after the following log

    2024-06-17T11:01:59.291750Z  INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
    2024-06-17T11:01:59.291802Z  INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
    2024-06-17T11:01:59.336401Z  INFO text_generation_benchmark: benchmark/src/main.rs:161: Tokenizer loaded
    2024-06-17T11:01:59.365280Z  INFO text_generation_benchmark: benchmark/src/main.rs:170: Connect to model server
    2024-06-17T11:01:59.368575Z  INFO text_generation_benchmark: benchmark/src/main.rs:179: Connected

I also tried llama2-7b with a single GPU card with sequence-length of 512 and decode-length of 128, but stucked too.

2024-06-17T10:54:34.661975Z  INFO text_generation_launcher: Convert: [1/2] -- Took: 0:00:23.355863
2024-06-17T10:54:42.624075Z  INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:07.961668
2024-06-17T10:54:43.550339Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-17T10:54:43.550676Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-17T10:54:46.861699Z  INFO text_generation_launcher: Detected system rocm
2024-06-17T10:54:46.929654Z  INFO text_generation_launcher: ROCm: using Flash Attention 2 Composable Kernel implementation.
2024-06-17T10:54:47.181972Z  WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-06-17T10:54:53.564579Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-17T10:54:58.632695Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-17T10:54:58.670817Z  INFO shard-manager: text_generation_launcher: Shard ready in 15.119042733s rank=0
2024-06-17T10:54:58.766242Z  INFO text_generation_launcher: Starting Webserver
2024-06-17T10:54:58.849177Z  INFO text_generation_router: router/src/main.rs:302: Using config Some(Llama)
2024-06-17T10:54:58.849209Z  WARN text_generation_router: router/src/main.rs:311: no pipeline tag found for model /home/zhuh/7b-chat-hf
2024-06-17T10:54:58.849213Z  WARN text_generation_router: router/src/main.rs:329: Invalid hostname, defaulting to 0.0.0.0
2024-06-17T10:54:58.853566Z  INFO text_generation_router::server: router/src/server.rs:1552: Warming up model
2024-06-17T10:54:59.601144Z  INFO text_generation_launcher: PyTorch TunableOp (https://github.com/fxmarty/pytorch/tree/2.3-patched/aten/src/ATen/cuda/tunable) is enabled. The warmup may take several minutes, picking the ROCm optimal matrix multiplication kernel for the target lengths 1, 2, 4, 8, 16, 32, with typical 5-8% latency improvement for small sequence lengths. The picked GEMMs are saved in the file /data/tunableop_-home-zhuh-7b-chat-hf_tp1_rank0.csv. To disable TunableOp, please launch TGI with `PYTORCH_TUNABLEOP_ENABLED=0`.
2024-06-17T10:54:59.601247Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=1
2024-06-17T10:55:46.295162Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=2
2024-06-17T10:56:18.910991Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=4
2024-06-17T10:56:51.715308Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=8
2024-06-17T10:57:24.784412Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=16
2024-06-17T10:57:59.430531Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=32
2024-06-17T10:58:29.335915Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-06-17T10:58:30.344828Z  INFO text_generation_router::server: router/src/server.rs:1579: Using scheduler V3
2024-06-17T10:58:30.344853Z  INFO text_generation_router::server: router/src/server.rs:1631: Setting max batch total tokens to 346576
2024-06-17T10:58:30.360395Z  INFO text_generation_router::server: router/src/server.rs:1868: Connected

Expected behavior

Prefill and decode latency is expected but it gets stacked and output nothing in nearly one hour. Besides, the GPU usibility is zero, which is non-zero when setup the warmup steps

LysandreJik commented 2 weeks ago

Thanks for the report @yuqie!

cc @fxmarty as the author of the benchmark

fxmarty commented 2 weeks ago

Hi @yuqie, thank you. What happens after launching

text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
    --sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
    -b 1 -b 2 

in the second terminal within the container?

You should have a graphic benchmark like https://youtu.be/jlMAX2Oaht0?t=198 at this point.