yuanwu2017 commented 1 month ago

What does this PR do?

It is for upgrade the SynapseAI 1.18. But currently the SynapseAI 1.18 is not release. Test for optimum-habana-v1.14.1

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

yuanwu2017 commented 1 month ago

@mandy-li @tthakkal

yuanwu2017 commented 1 month ago

@regisss ci_07102024 uses the optimum 1.17.1. Because the optimum only supports the transformers<4.45. When will optimum be released for transformers>4.45?

regisss commented 1 month ago

@regisss ci_07102024 uses the optimum 1.17.1. Because the optimum only supports the transformers<4.45. When will optimum be released for transformers>4.45?

Very soon but there is no exact ETA. Is this an issue for TGI? I don't think we need the latest version of Optimum here on?

mandy-li commented 1 month ago

@yuanwu2017 , thanks for the PR. Can you pls do a quick test to see if this OH tag works with Synapse 1.17? Synapse 1.18 will be ready soon

yuanwu2017 commented 1 month ago

@regisss ci_07102024 uses the optimum 1.17.1. Because the optimum only supports the transformers<4.45. When will optimum be released for transformers>4.45?

Very soon but there is no exact ETA. Is this an issue for TGI? I don't think we need the latest version of Optimum here on?

If the optimum can be ready before 1.18 release. The tgi-gaudi can support the AutoGPTQ in 1.18 release. The following patch is not included in optimum-1.17 and 1.22. https://github.com/huggingface/optimum/pull/2003

yuanwu2017 commented 1 month ago

@yuanwu2017 , thanks for the PR. Can you pls do a quick test to see if this OH tag works with Synapse 1.17? Synapse 1.18 will be ready soon

Ok.

yuanwu2017 commented 1 month ago

run_example.txt I tested the examples except for the FP8 examples. All of them passed. @mandy-li @tthakkal

regisss commented 1 month ago

@yuanwu2017 New Optimum release: https://github.com/huggingface/optimum/releases/tag/v1.23.0

mandy-li commented 1 month ago

@yuanwu2017 , why DS not upgrade to 1.18? pls use official release version of Synapse 1.18 since it was just released. https://vault.habana.ai/ui/repos/tree/General/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0

tthakkal commented 1 month ago

@yuanwu2017 , why DS not upgrade to 1.18? pls use official release version of Synapse 1.18 since it was just released. https://vault.habana.ai/ui/repos/tree/General/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0

@mandy-li I updated dockerfile to use official docker image and deep speed version, will do some testing

yuanwu2017 commented 1 month ago

I tested the GTPQ model. It was ok.

model=TheBloke/Llama-2-7B-Chat-GPTQ

docker run -p $port:80 \
   --runtime=habana \
   -v $volume:/data \
   -e HABANA_VISIBLE_DEVICES=all \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
   -e MAX_TOTAL_TOKENS=2048 \
   -e PREFILL_BATCH_BUCKET_SIZE=2 \
   -e BATCH_BUCKET_SIZE=32 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
   -e ENABLE_HPU_GRAPH=true \
   -e LIMIT_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   $image \
   --model-id $model \
   --max-input-length 1024 --max-total-tokens 2048 \
   --max-batch-prefill-tokens 2048 --max-batch-total-tokens 131072 \
   --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64

mandy-li commented 1 month ago

@yuanwu2017 , we are seeing perf regression for a lot of models with this PR. Why you didn't test the models you used to test for last release, such as Llama2, Llama3, or Llama3.1 and compare with your previous results?

yuanwu2017 commented 1 month ago

Performance test Server command:

hf_token=$token
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
image=tgi-yuanwu:1.18
#image=ghcr.io/huggingface/tgi-gaudi:2.0.5
port=8083

model=meta-llama/Llama-2-7b-chat-hf

docker run -p $port:80 \
   --runtime=habana \
   -v $volume:/data \
   -e HABANA_VISIBLE_DEVICES=all \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
   -e MAX_TOTAL_TOKENS=2048 \
   -e PREFILL_BATCH_BUCKET_SIZE=2 \
   -e BATCH_BUCKET_SIZE=32 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
   -e ENABLE_HPU_GRAPH=true \
   -e LIMIT_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   $image \
   --model-id $model \
   --max-input-length 1024 --max-total-tokens 2048 \
   --max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 \
   --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 256

Client command:

cd examples/
 python run_generation.py

Test result

FW1.17 ghcr.io/huggingface/tgi-gaudi:2.0.5

----- Performance summary -----

Throughput: 1147.2 tokens/s Throughput: 2.5 queries/s

First token latency: Median: 88494.87ms Average: 83477.33ms

Output token latency: Median: 25.29ms Average: 27.24ms

hl-1.18.0-fw-53.1.1.1 tgi-yuanwu:1.18

----- Performance summary -----

Throughput: 1157.7 tokens/s Throughput: 2.4 queries/s

First token latency: Median: 89182.70ms Average: 84333.15ms

Output token latency: Median: 25.32ms Average: 27.04ms

yuanwu2017 commented 1 month ago

meta-llama/Llama-2-70b-chat-hf Perfomance:

model=meta-llama/Llama-2-70b-chat-hf

docker run -p $port:80 \
   --runtime=habana \
   -v $volume:/data \
   -e HABANA_VISIBLE_DEVICES=all \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e MAX_TOTAL_TOKENS=2048 \
   -e BATCH_BUCKET_SIZE=256 \
   -e PREFILL_BATCH_BUCKET_SIZE=4 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
   -e ENABLE_HPU_GRAPH=true \
   -e LIMIT_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   $image \
   --model-id $model \
   --sharded true --num-shard 8 \
   --max-input-length 1024 --max-total-tokens 2048 \
   --max-batch-prefill-tokens 4096 --max-batch-total-tokens 524288 \
   --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512

client command: python run_generation.py --model_id meta-llama/Llama-2-70b-chat-hf Result: tgi-gaudi 2.0.5 + FW1.17

tgi-gaudi 2.0.5 + PR227 + FW1.18

The performance has 30% regression with llama2-70B on 8 cards

I replaced the PR227's pytorch and deepspeed with version 1.17, and ran the benchmark. The performance has only 6% regression, so the performance regression is caused by the habana pytorch. Refer to previous performance data of 1 card, there is almost no performance regression. I think the issue should be related with distributed communication layer. I tried to run the hccl benchmark, but I failed.

tthakkal commented 1 month ago

@yuanwu2017 @schoi-habana and I tried reproducing performance regression using your commands, we completely get different numbers from you and don't see performance regression.

FW: 1.18 Docker and ds 1.18, PR 227  ----   3614.5 tokens/s
FW: 1.18  TGI 2.0.5                 ----   3557.3 tokens/s
FW: 1.17  TGI 2.0.5                 ----   3603.2 tokens/s

Could you try applying PR https://github.com/huggingface/tgi-gaudi/pull/234 to this PR and set TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true at docker run and check performance if that fixes anything for you.

yuanwu2017 commented 1 month ago

I ran two rounds of run_generation.py.

FW1.18 +PR227 +PR234 + datasets seed=42      -------- 3420.3 3138.0
FW1.17 + TGI 2.0.5 + PR234 + datasets seed =42-------- 3651.4 3621.2
FW1.18 +TGI 2.0.5+PR234+ datasets seed =42    ---------3418.1   3507.4  3526.7

yuanwu2017 commented 1 month ago

model=meta-llama/Llama-2-70b-chat-hf I ran three more rounds of tests with W1.18 +TGI 2.0.5+PR227 +PR234 + datasets seed=42. The average throughput is about 3311 tokens/sec

The performance numbers are close. The PR234 make the jitter and performance regression smaller.

In order to run the static shape, I used the run_tgi_benchmark.sh. It sent 32 requests to TGI serving at the same time, and after returning, send 32 more. I got following result. The performance is very similar. FW1.18 +TGI 2.0.5+PR227 +PR234 + datasets seed=42

FW1.17 + TGI 2.0.5 + PR234 + datasets seed =42

mandy-li commented 1 month ago

@schoi-habana , @regisss , pls review. We will release TGI-Gaudi 2.06 with this PR

huggingface / tgi-gaudi

upgrade to SynapseAI 1.18 #227

What does this PR do?

Before submitting

Who can review?

----- Performance summary -----

Throughput: 1147.2 tokens/s Throughput: 2.5 queries/s

First token latency: Median: 88494.87ms Average: 83477.33ms

Output token latency: Median: 25.29ms Average: 27.24ms

----- Performance summary -----

Throughput: 1157.7 tokens/s Throughput: 2.4 queries/s

First token latency: Median: 89182.70ms Average: 84333.15ms

Output token latency: Median: 25.32ms Average: 27.04ms