huggingface / tgi-gaudi

Large Language Model Text Generation Inference on Habana Gaudi
http://hf.co/docs/text-generation-inference
Apache License 2.0
24 stars 38 forks source link

low throughput while using TGI-Gaudi on bigcode/starcoderbase-3b on Gaudi2 #166

Open vishnumadhu365 opened 2 months ago

vishnumadhu365 commented 2 months ago

System Info

tgi-gaudi docker container built from master branch (4fe871ffaaa62f1a203607078e868fcca962b017) Ubuntu 22.04.3 LTS Gaudi2 HL-SMI Version: hl-1.15.0-fw-48.2.1.1 Driver Version: 1.15.0-a596ef0 Model : bigcode/starcoderbase-3b

Information

Tasks

Reproduction

Steps

  1. Docker run

    docker run -it -p 8080:80 -v $volume:/data    --runtime=habana   \
    -e HABANA_VISIBLE_DEVICES=all  \
    -e HUGGING_FACE_HUB_TOKEN=1234  \
    -e OMPI_MCA_btl_vader_single_copy_mechanism=none   \
    -e ENABLE_HPU_GRAPH=False   -e BATCH_BUCKET_SIZE=128  \
    -e PREFILL_BATCH_BUCKET_SIZE=4  \
    -e PAD_SEQUENCE_TO_MULTIPLE_OF=128    \
    --cap-add=sys_nice  \
    --ipc=host tgi-gaudi:latest   \
    --model-id $model    \
    --max-input-tokens 568    \
    --max-batch-prefill-tokens 618  \
    --max-total-tokens 614  \
    --max-batch-total-tokens 78592
  2. Measure perf of TGI endpoint with tgi-gaudi/examples

    python3 run_generation.py \
    --model_id $model \
    --server_address http://localhost:8080 \
    --max_input_length 568 \
    --max_output_length 46 \
    --total_sample_count 1280 \
    --max_concurrent_requests 128

output:

--------------------------------
----- Performance  summary -----
--------------------------------
Throughput: 98.8 tokens/s
Throughput: 2.2 queries/s
--------------------------------
First token latency:
        Median:         54734.41ms
        Average:        52755.73ms
--------------------------------
Output token latency:
        Median:         58.47ms
        Average:        69.58ms
--------------------------------
  1. Run Static benchmark from within tgi container
    text-generation-benchmark -b 128 -b 64 -b 32 -b 16 -b 8 -b 4 -b 2 -b 1 -s 567 -d 46 -w 5 -r 100 -t bigcode/starcoderbase-3b

output: image

Expected behavior

Issue: Throughput numbers while hitting the TGI endpoint is way off from the static benchmark throughput. Server logs suggest there is some issue with continuous batching on GD2.

#testing by sending 5 request on Gaudi2 TGI endpoint. Note that the queue time is increasing for subsequent inference requests
Req1: total_time="3.076226394s" validation_time="449.063µs" queue_time="110.028µs" inference_time="3.075667684s" time_per_token="66.86234ms"
Req2: total_time="3.076173218s" validation_time="3.502745ms" queue_time="70.64658ms" inference_time="3.002024052s" time_per_token="65.261392ms"
Req3: total_time="3.132718439s" validation_time=""786.778µs" queue_time="201.632982ms" inference_time="2.930298993s" time_per_token="63.702152ms"
Req4: total_time="3.197355097s" validation_time="1.277488ms" queue_time="331.050014ms" inference_time="2.865027991s" time_per_token="62.283217ms"
Req5: total_time="3.259123777s" validation_time="924.292µs" queue_time="459.104331ms" inference_time="2.799095535s" time_per_token="60.849902ms" 
#Same test as above this time sending 5 requests to a single Nvidia T4 card running TGI docker 2.0.4. Note that the queue time is more or less constant after the first request indicating effective continuous batching

Req1: total_time="1.513475533s" validation_time="1.069695ms" queue_time="52.017µs" inference_time="1.512354236s" time_per_token="32.877266ms"
Req2: total_time="1.507096983s" validation_time="799.031µs" queue_time="54.518157ms" inference_time="1.451780025s" time_per_token="31.560435ms"
Req3: total_time="1.502753387s" validation_time="418.679µs" queue_time="50.525381ms" inference_time="1.451809782s" time_per_token="31.561082ms"
Req4: total_time="1.507244713s" validation_time="841.468µs" queue_time="54.479958ms" inference_time="1.451923498s" time_per_token="31.563554ms"
Req5: total_time="1.503086631s" validation_time="828.972µs" queue_time="50.359691ms" inference_time="1.451898309s" time_per_token="31.563006ms"

Expected result: Gaudi 2 throughput numbers on the TGI endpoint (with continuous batching) should be at par or better than the static benchmark throughput

regisss commented 1 month ago

Not sure why the queue time is increasing, any idea @kdamaszk ?