Closed yuanwu2017 closed 4 weeks ago
@mandy-li @tthakkal
@regisss ci_07102024 uses the optimum 1.17.1. Because the optimum only supports the transformers<4.45. When will optimum be released for transformers>4.45?
@regisss ci_07102024 uses the optimum 1.17.1. Because the optimum only supports the transformers<4.45. When will optimum be released for transformers>4.45?
Very soon but there is no exact ETA. Is this an issue for TGI? I don't think we need the latest version of Optimum here on?
@yuanwu2017 , thanks for the PR. Can you pls do a quick test to see if this OH tag works with Synapse 1.17? Synapse 1.18 will be ready soon
@regisss ci_07102024 uses the optimum 1.17.1. Because the optimum only supports the transformers<4.45. When will optimum be released for transformers>4.45?
Very soon but there is no exact ETA. Is this an issue for TGI? I don't think we need the latest version of Optimum here on?
If the optimum can be ready before 1.18 release. The tgi-gaudi can support the AutoGPTQ in 1.18 release. The following patch is not included in optimum-1.17 and 1.22. https://github.com/huggingface/optimum/pull/2003
@yuanwu2017 , thanks for the PR. Can you pls do a quick test to see if this OH tag works with Synapse 1.17? Synapse 1.18 will be ready soon
Ok.
run_example.txt I tested the examples except for the FP8 examples. All of them passed. @mandy-li @tthakkal
@yuanwu2017 New Optimum release: https://github.com/huggingface/optimum/releases/tag/v1.23.0
@yuanwu2017 , why DS not upgrade to 1.18? pls use official release version of Synapse 1.18 since it was just released. https://vault.habana.ai/ui/repos/tree/General/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
@yuanwu2017 , why DS not upgrade to 1.18? pls use official release version of Synapse 1.18 since it was just released. https://vault.habana.ai/ui/repos/tree/General/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
@mandy-li I updated dockerfile to use official docker image and deep speed version, will do some testing
I tested the GTPQ model. It was ok.
model=TheBloke/Llama-2-7B-Chat-GPTQ
docker run -p $port:80 \
--runtime=habana \
-v $volume:/data \
-e HABANA_VISIBLE_DEVICES=all \
-e HUGGING_FACE_HUB_TOKEN=$hf_token \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
-e MAX_TOTAL_TOKENS=2048 \
-e PREFILL_BATCH_BUCKET_SIZE=2 \
-e BATCH_BUCKET_SIZE=32 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
--cap-add=sys_nice \
--ipc=host \
$image \
--model-id $model \
--max-input-length 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 --max-batch-total-tokens 131072 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
@yuanwu2017 , we are seeing perf regression for a lot of models with this PR. Why you didn't test the models you used to test for last release, such as Llama2, Llama3, or Llama3.1 and compare with your previous results?
Performance test Server command:
hf_token=$token
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
image=tgi-yuanwu:1.18
#image=ghcr.io/huggingface/tgi-gaudi:2.0.5
port=8083
model=meta-llama/Llama-2-7b-chat-hf
docker run -p $port:80 \
--runtime=habana \
-v $volume:/data \
-e HABANA_VISIBLE_DEVICES=all \
-e HUGGING_FACE_HUB_TOKEN=$hf_token \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e http_proxy=${http_proxy} -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
-e MAX_TOTAL_TOKENS=2048 \
-e PREFILL_BATCH_BUCKET_SIZE=2 \
-e BATCH_BUCKET_SIZE=32 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
--cap-add=sys_nice \
--ipc=host \
$image \
--model-id $model \
--max-input-length 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 256
Client command:
cd examples/
python run_generation.py
Test result
FW1.17 ghcr.io/huggingface/tgi-gaudi:2.0.5
hl-1.18.0-fw-53.1.1.1 tgi-yuanwu:1.18
meta-llama/Llama-2-70b-chat-hf Perfomance:
model=meta-llama/Llama-2-70b-chat-hf
docker run -p $port:80 \
--runtime=habana \
-v $volume:/data \
-e HABANA_VISIBLE_DEVICES=all \
-e HUGGING_FACE_HUB_TOKEN=$hf_token \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e http_proxy=${http_proxy} -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e MAX_TOTAL_TOKENS=2048 \
-e BATCH_BUCKET_SIZE=256 \
-e PREFILL_BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
--cap-add=sys_nice \
--ipc=host \
$image \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-length 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 4096 --max-batch-total-tokens 524288 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
client command:
python run_generation.py --model_id meta-llama/Llama-2-70b-chat-hf
Result:
tgi-gaudi 2.0.5 + FW1.17
tgi-gaudi 2.0.5 + PR227 + FW1.18
The performance has 30% regression with llama2-70B on 8 cards
I replaced the PR227's pytorch and deepspeed with version 1.17, and ran the benchmark. The performance has only 6% regression, so the performance regression is caused by the habana pytorch. Refer to previous performance data of 1 card, there is almost no performance regression. I think the issue should be related with distributed communication layer. I tried to run the hccl benchmark, but I failed.
@yuanwu2017 @schoi-habana and I tried reproducing performance regression using your commands, we completely get different numbers from you and don't see performance regression.
FW: 1.18 Docker and ds 1.18, PR 227 ---- 3614.5 tokens/s
FW: 1.18 TGI 2.0.5 ---- 3557.3 tokens/s
FW: 1.17 TGI 2.0.5 ---- 3603.2 tokens/s
Could you try applying PR https://github.com/huggingface/tgi-gaudi/pull/234 to this PR and set TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true
at docker run and check performance if that fixes anything for you.
I ran two rounds of run_generation.py.
FW1.18 +PR227 +PR234 + datasets seed=42 -------- 3420.3 3138.0
FW1.17 + TGI 2.0.5 + PR234 + datasets seed =42-------- 3651.4 3621.2
FW1.18 +TGI 2.0.5+PR234+ datasets seed =42 ---------3418.1 3507.4 3526.7
model=meta-llama/Llama-2-70b-chat-hf I ran three more rounds of tests with W1.18 +TGI 2.0.5+PR227 +PR234 + datasets seed=42. The average throughput is about 3311 tokens/sec
The performance numbers are close. The PR234 make the jitter and performance regression smaller.
In order to run the static shape, I used the run_tgi_benchmark.sh. It sent 32 requests to TGI serving at the same time, and after returning, send 32 more. I got following result. The performance is very similar. FW1.18 +TGI 2.0.5+PR227 +PR234 + datasets seed=42
FW1.17 + TGI 2.0.5 + PR234 + datasets seed =42
@schoi-habana , @regisss , pls review. We will release TGI-Gaudi 2.06 with this PR
What does this PR do?
It is for upgrade the SynapseAI 1.18. But currently the SynapseAI 1.18 is not release. Test for optimum-habana-v1.14.1
Fixes # (issue)
Before submitting
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.