huggingface / tgi-gaudi

Large Language Model Text Generation Inference on Habana Gaudi
http://hf.co/docs/text-generation-inference
Apache License 2.0
27 stars 47 forks source link

Upgrade SynapseAI version to 1.17.0 #208

Closed yuanwu2017 closed 3 months ago

yuanwu2017 commented 3 months ago

What does this PR do?

Upgrade SynapseAI version to 1.17.0. Known issues: Switch to official OH release.

ci_09082024 test:

  1. On 1 Gaudi/Gaudi2 card model=meta-llama/Llama-2-7b-hf docker run --rm -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048

    image image
  2. On 1 Gaudi/Gaudi2 card using pytorch eager mode with torch compile model=meta-llama/Llama-2-7b-hf docker run -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_LAZY_MODE=0 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048

image image

3.On 8 Gaudi/Gaudi2 cards: model=meta-llama/Llama-2-70b-hf

docker run --rm -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --sharded true --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048

image image
  1. LLama 7b BF16 on 1 Gaudi2 card: model=meta-llama/Llama-2-7b-chat-hf
docker run --rm -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -e HABANA_VISIBLE_DEVICES=all \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e HF_HUB_ENABLE_HF_TRANSFER=1 \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e PREFILL_BATCH_BUCKET_SIZE=1 \
   -e BATCH_BUCKET_SIZE=16 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=128 \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 \
   --model-id $model \
   --max-input-tokens 1024 \
   --max-batch-prefill-tokens 4096 \
   --max-total-tokens 2048 \
   --max-batch-size 16
image image

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@regisss @libinta @mandy-li

mandy-li commented 3 months ago

@yuanwu2017 , just a reminder when you upgrade package version, please refer to https://github.com/huggingface/tgi-gaudi/pull/206 as well to make sure CVE problems can also be covered.

yuanwu2017 commented 3 months ago

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

Can you tell me how to run the CVE scan? Thanks.

mandy-li commented 3 months ago

@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.

Can you tell me how to run the CVE scan? Thanks.

CVE scan is done by another intel team. They found some issues that those versions of python packages got CVE problem.

yuanwu2017 commented 3 months ago

Update the ci_09082024 test result.

yuanwu2017 commented 3 months ago

llava-next test Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)
image image
tthakkal commented 3 months ago

llava-next test Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)

image image

@yuanwu2017 what is the minimum value for --max-input-tokens, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for --max-input-tokens is 4096

yuanwu2017 commented 3 months ago

The minimum value for --max-input-tokens is 2048. https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/vlm_causal_lm.py#L74

yuanwu2017 commented 3 months ago

llava-next test Command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
   -v ~/workspace/data:/data \
   -v ~/workspace/tmp:/tmp \
   -v ~/workspace:/workspace \
   --runtime=habana \
   --name optimum-1.17 \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1,2,4,5 \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17 --model-id $model \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384

client source code:

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:8083")

# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
    tokens += token
print(tokens)

image image

@yuanwu2017 what is the minimum value for --max-input-tokens, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for --max-input-tokens is 4096

Done.

yuanwu2017 commented 3 months ago

Update the performance data: command:

docker run -it --rm -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -v ~/workspace:/workspace \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=all \
   -e HABANA_VISIBLE_MODULES=all \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e BATCH_BUCKET_SIZE=8 \
   -e PREFILL_BATCH_BUCKET_SIZE=8 \
   -e ENABLE_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17
   --model-id meta-llama/Llama-2-13b-chat-hf \
   --max-input-tokens 4096 \
   --max-total-tokens 8192 \
   --max-batch-prefill-tokens 16384 \
   --max-batch-total-tokens 81920 \
   --sharded true --num-shard 8

client: hey -t 0 -m POST -D ./data.json -H "Content-Type: application/json" -c 5 -n 10 http://127.0.0.1:8083/generate

Result: first round:

image

second round:

image

1.13-release branch:

image
yuanwu2017 commented 3 months ago

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

tthakkal commented 3 months ago

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

yuanwu2017 commented 3 months ago

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.

tthakkal commented 3 months ago

Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.

Why not set cache_position only when model is mixtral

Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.

In optimum habana cache_position is not calculated in modeling_mixtral but sets initially here, https://github.com/huggingface/optimum-habana/blob/3e7ff03a54068d7bac8114b510ed546f32d909e6/optimum/habana/transformers/generation/utils.py#L2199 Not sure if we need to follow similar pattern or let model calculate.

yuanwu2017 commented 3 months ago

@regisss @mandy-li @tthakkal Switched to OH-1.13.1 official release, and tested following models. They all passed. Please help to review the patch.

mistralai/Mixtral-8x7B-v0.1 llava-hf/llava-v1.6-mistral-7b-hf meta-llama/Llama-2-7b-hf meta-llama/Llama-2-70b-hf meta-llama/Llama-2-13b-chat-hf