Closed yuanwu2017 closed 3 months ago
@yuanwu2017 , just a reminder when you upgrade package version, please refer to https://github.com/huggingface/tgi-gaudi/pull/206 as well to make sure CVE problems can also be covered.
@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.
Can you tell me how to run the CVE scan? Thanks.
@yuanwu2017 , just a reminder when you upgrade package version, please refer to #206 as well to make sure CVE problems can also be covered.
Can you tell me how to run the CVE scan? Thanks.
CVE scan is done by another intel team. They found some issues that those versions of python packages got CVE problem.
Update the ci_09082024 test result.
llava-next test Command:
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=/home/yuanwu/workspace/data
docker run -it -p 8083:80 \
-v ~/workspace/data:/data \
-v ~/workspace/tmp:/tmp \
-v ~/workspace:/workspace \
--runtime=habana \
--name optimum-1.17 \
-e http_proxy=${http_proxy} -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HABANA_VISIBLE_DEVICES=1,2,4,5 \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--ipc=host \
tgi-yuanwu:1.17 --model-id $model \
--max-input-tokens 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 16384
client source code:
from huggingface_hub import InferenceClient
import base64
import requests
import io
client = InferenceClient("http://127.0.0.1:8083")
# read image from local file
image_path = "rabbit.png"
#image_path = "llava_v1_5_radar.jpg"
with open(image_path, "rb") as f:
image = base64.b64encode(f.read()).decode("utf-8")
image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"
tokens = ''
for token in client.text_generation(prompt, max_new_tokens=40, stream=True):
tokens += token
print(tokens)
llava-next test Command:
model=llava-hf/llava-v1.6-mistral-7b-hf volume=/home/yuanwu/workspace/data docker run -it -p 8083:80 \ -v ~/workspace/data:/data \ -v ~/workspace/tmp:/tmp \ -v ~/workspace:/workspace \ --runtime=habana \ --name optimum-1.17 \ -e http_proxy=${http_proxy} -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \ -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ -e HABANA_VISIBLE_DEVICES=1,2,4,5 \ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ --cap-add=sys_nice \ --ipc=host \ tgi-yuanwu:1.17 --model-id $model \ --max-input-tokens 4096 \ --max-total-tokens 8192 \ --max-batch-prefill-tokens 16384
client source code:
from huggingface_hub import InferenceClient import base64 import requests import io client = InferenceClient("http://127.0.0.1:8083") # read image from local file image_path = "rabbit.png" #image_path = "llava_v1_5_radar.jpg" with open(image_path, "rb") as f: image = base64.b64encode(f.read()).decode("utf-8") image = f"data:image/png;base64,{image}" prompt = f"![]({image})What is this a picture of?\n\n" tokens = '' for token in client.text_generation(prompt, max_new_tokens=40, stream=True): tokens += token print(tokens)
@yuanwu2017 what is the minimum value for --max-input-tokens
, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for --max-input-tokens
is 4096
The minimum value for --max-input-tokens is 2048. https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/vlm_causal_lm.py#L74
llava-next test Command:
model=llava-hf/llava-v1.6-mistral-7b-hf volume=/home/yuanwu/workspace/data docker run -it -p 8083:80 \ -v ~/workspace/data:/data \ -v ~/workspace/tmp:/tmp \ -v ~/workspace:/workspace \ --runtime=habana \ --name optimum-1.17 \ -e http_proxy=${http_proxy} -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \ -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ -e HABANA_VISIBLE_DEVICES=1,2,4,5 \ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ --cap-add=sys_nice \ --ipc=host \ tgi-yuanwu:1.17 --model-id $model \ --max-input-tokens 4096 \ --max-total-tokens 8192 \ --max-batch-prefill-tokens 16384
client source code:
from huggingface_hub import InferenceClient import base64 import requests import io client = InferenceClient("http://127.0.0.1:8083") # read image from local file image_path = "rabbit.png" #image_path = "llava_v1_5_radar.jpg" with open(image_path, "rb") as f: image = base64.b64encode(f.read()).decode("utf-8") image = f"data:image/png;base64,{image}" prompt = f"![]({image})What is this a picture of?\n\n" tokens = '' for token in client.text_generation(prompt, max_new_tokens=40, stream=True): tokens += token print(tokens)
@yuanwu2017 what is the minimum value for
--max-input-tokens
, 4096 works but any value less than that errors out . the example input image in request has ~3000 tokens . probably we should mention in our read me that minimum value for--max-input-tokens
is 4096
Done.
Update the performance data: command:
docker run -it --rm -p 8083:80 \
--runtime=habana \
-v $volume:/data \
-v ~/workspace:/workspace \
-e HUGGING_FACE_HUB_TOKEN=$hf_token \
-e http_proxy=${http_proxy} -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HABANA_VISIBLE_DEVICES=all \
-e HABANA_VISIBLE_MODULES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e BATCH_BUCKET_SIZE=8 \
-e PREFILL_BATCH_BUCKET_SIZE=8 \
-e ENABLE_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
--cap-add=sys_nice \
--ipc=host \
tgi-yuanwu:1.17
--model-id meta-llama/Llama-2-13b-chat-hf \
--max-input-tokens 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 16384 \
--max-batch-total-tokens 81920 \
--sharded true --num-shard 8
client:
hey -t 0 -m POST -D ./data.json -H "Content-Type: application/json" -c 5 -n 10 http://127.0.0.1:8083/generate
Result: first round:
second round:
1.13-release branch:
Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.
Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.
Why not set cache_position only when model is mixtral
Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.
Why not set cache_position only when model is mixtral
Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.
Revert the patch of "Fix mixtral model error". It causes the inference error of meta-llama/Llama-2-13b-chat-hf. So the PR1272 must be included in new OH release.
Why not set cache_position only when model is mixtral
Because optimal-habana already knows all the information and can calculate the cache_position. So adding this processing to different models is a bit like a workaround.
In optimum habana cache_position is not calculated in modeling_mixtral but sets initially here, https://github.com/huggingface/optimum-habana/blob/3e7ff03a54068d7bac8114b510ed546f32d909e6/optimum/habana/transformers/generation/utils.py#L2199 Not sure if we need to follow similar pattern or let model calculate.
@regisss @mandy-li @tthakkal Switched to OH-1.13.1 official release, and tested following models. They all passed. Please help to review the patch.
mistralai/Mixtral-8x7B-v0.1 llava-hf/llava-v1.6-mistral-7b-hf meta-llama/Llama-2-7b-hf meta-llama/Llama-2-70b-hf meta-llama/Llama-2-13b-chat-hf
What does this PR do?
Upgrade SynapseAI version to 1.17.0. Known issues: Switch to official OH release.
ci_09082024 test:
On 1 Gaudi/Gaudi2 card
model=meta-llama/Llama-2-7b-hf
docker run --rm -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048
On 1 Gaudi/Gaudi2 card using pytorch eager mode with torch compile
model=meta-llama/Llama-2-7b-hf
docker run -p 8083:80 -v ~/workspace/data:/data -v ~/workspace/tmp:/tmp -v ~/workspace:/workspace --runtime=habana --name optimum-1.17 -e PT_HPU_LAZY_MODE=0 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=1 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --max-input-tokens 1024 --max-total-tokens 2048
3.On 8 Gaudi/Gaudi2 cards:
model=meta-llama/Llama-2-70b-hf
docker run --rm -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi-yuanwu:1.17 --model-id $model --sharded true --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048
model=meta-llama/Llama-2-7b-chat-hf
Before submitting
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@regisss @libinta @mandy-li