low throughput while using TGI-Gaudi on bigcode/starcoderbase-3b on Gaudi2

System Info

tgi-gaudi docker container built from master branch (4fe871ffaaa62f1a203607078e868fcca962b017) Ubuntu 22.04.3 LTS Gaudi2 HL-SMI Version: hl-1.15.0-fw-48.2.1.1 Driver Version: 1.15.0-a596ef0 Model : bigcode/starcoderbase-3b

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[X] My own modifications

Reproduction

Steps

Docker run

docker run -it -p 8080:80 -v $volume:/data    --runtime=habana   \
-e HABANA_VISIBLE_DEVICES=all  \
-e HUGGING_FACE_HUB_TOKEN=1234  \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none   \
-e ENABLE_HPU_GRAPH=False   -e BATCH_BUCKET_SIZE=128  \
-e PREFILL_BATCH_BUCKET_SIZE=4  \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=128    \
--cap-add=sys_nice  \
--ipc=host tgi-gaudi:latest   \
--model-id $model    \
--max-input-tokens 568    \
--max-batch-prefill-tokens 618  \
--max-total-tokens 614  \
--max-batch-total-tokens 78592

Measure perf of TGI endpoint with tgi-gaudi/examples

python3 run_generation.py \
--model_id $model \
--server_address http://localhost:8080 \
--max_input_length 568 \
--max_output_length 46 \
--total_sample_count 1280 \
--max_concurrent_requests 128

output:

--------------------------------
----- Performance  summary -----
--------------------------------
Throughput: 98.8 tokens/s
Throughput: 2.2 queries/s
--------------------------------
First token latency:
        Median:         54734.41ms
        Average:        52755.73ms
--------------------------------
Output token latency:
        Median:         58.47ms
        Average:        69.58ms
--------------------------------

Run Static benchmark from within tgi container

text-generation-benchmark -b 128 -b 64 -b 32 -b 16 -b 8 -b 4 -b 2 -b 1 -s 567 -d 46 -w 5 -r 100 -t bigcode/starcoderbase-3b

output:

Expected behavior

Issue: Throughput numbers while hitting the TGI endpoint is way off from the static benchmark throughput. Server logs suggest there is some issue with continuous batching on GD2.

#testing by sending 5 request on Gaudi2 TGI endpoint. Note that the queue time is increasing for subsequent inference requests
Req1: total_time="3.076226394s" validation_time="449.063µs" queue_time="110.028µs" inference_time="3.075667684s" time_per_token="66.86234ms"
Req2: total_time="3.076173218s" validation_time="3.502745ms" queue_time="70.64658ms" inference_time="3.002024052s" time_per_token="65.261392ms"
Req3: total_time="3.132718439s" validation_time=""786.778µs" queue_time="201.632982ms" inference_time="2.930298993s" time_per_token="63.702152ms"
Req4: total_time="3.197355097s" validation_time="1.277488ms" queue_time="331.050014ms" inference_time="2.865027991s" time_per_token="62.283217ms"
Req5: total_time="3.259123777s" validation_time="924.292µs" queue_time="459.104331ms" inference_time="2.799095535s" time_per_token="60.849902ms"

#Same test as above this time sending 5 requests to a single Nvidia T4 card running TGI docker 2.0.4. Note that the queue time is more or less constant after the first request indicating effective continuous batching

Req1: total_time="1.513475533s" validation_time="1.069695ms" queue_time="52.017µs" inference_time="1.512354236s" time_per_token="32.877266ms"
Req2: total_time="1.507096983s" validation_time="799.031µs" queue_time="54.518157ms" inference_time="1.451780025s" time_per_token="31.560435ms"
Req3: total_time="1.502753387s" validation_time="418.679µs" queue_time="50.525381ms" inference_time="1.451809782s" time_per_token="31.561082ms"
Req4: total_time="1.507244713s" validation_time="841.468µs" queue_time="54.479958ms" inference_time="1.451923498s" time_per_token="31.563554ms"
Req5: total_time="1.503086631s" validation_time="828.972µs" queue_time="50.359691ms" inference_time="1.451898309s" time_per_token="31.563006ms"

Expected result: Gaudi 2 throughput numbers on the TGI endpoint (with continuous batching) should be at par or better than the static benchmark throughput

huggingface / tgi-gaudi