openvinotoolkit / openvino.genai

Run Generative AI models using native OpenVINO C++ API
Apache License 2.0
114 stars 152 forks source link

llama3 perf is low #494

Closed Edward-Lin closed 1 month ago

Edward-Lin commented 3 months ago

I've used https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python to test llama3's perf, which is a little low. python benchmark.py -d GPU -m D:/AIGC/llama/models/Meta-Llama-3-8B-Instruct-ov-fp16/pytorch/dldt/FP16 -n 1 -ic 64 -pf prompt\1024.jsonl -mc=2 OpenVINO Tokenizer version is not compatible with OpenVINO version. Installed OpenVINO version: 2024.0.0,OpenVINO Tokenizers requires . OpenVINO Tokenizers models will not be added during export. INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino C:\ProgramData\anaconda3\envs\env_ov_llm\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " function 'cadam32bit_grad_fp32' not found [ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: llama [ INFO ] OV Config={'PERFORMANCE_HINT': 'LATENCY', 'CACHE_DIR': '', 'NUM_STREAMS': '1'} model_path is D:\AIGC\llama\models\Meta-Llama-3-8B-Instruct-ov-fp16\pytorch\dldt\FP16 framework is ov model_args is {'prompt': None, 'prompt_file': 'prompt\1024.jsonl', 'infer_count': 64, 'images': None, 'seed': 42, 'mem_consumption': 2, 'batch_size': 1, 'fuse_decoding_strategy': False, 'stateful': None, 'save_prepared_model': None, 'num_beams': 1, 'torch_compile_backend': 'openvino', 'convert_tokenizer': False, 'subsequent': False, 'output_dir': None, 'use_case': 'text_gen', 'config': {'PERFORMANCE_HINT': 'LATENCY', 'CACHE_DIR': '', 'NUM_STREAMS': '1'}, 'model_type': 'decoder', 'model_name': 'llama'} model_name is llama ov_torch_backend_device is GPU [ INFO ] OPENVINO_TORCH_BACKEND_DEVICE=$OPENVINO_TORCH_BACKEND_DEVICE [ INFO ] Model path=D:\AIGC\llama\models\Meta-Llama-3-8B-Instruct-ov-fp16\pytorch\dldt\FP16, openvino runtime version: 2024.0.0-14509-34caeefd078-releases/2024/0 Compiling the model to GPU ... [ INFO ] From pretrained time: 183.83s Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [ INFO ] Read prompts from prompt\1024.jsonl [ INFO ] Numbeams: 1, benchmarking iter nums(exclude warm-up): 1, prompt nums: 1 [ INFO ] [warm-up] Input text: [INST] <>In the year 2048, the world was a very different place from what it had been just two decades before. The pace of technological progress had quickened to an almost unimaginable degree, and the changes that had swept through society as a result were nothing short of revolutionary. In many ways, the year 2048 represented the culmination of a long and tumultuous journey that humanity had been on since the dawn of civilization. Thegreat leaps forward in science and technology that had occurred over the course of the previous century had laid the groundwork for a future that was beyond anything anyone could have imagined. One of the most striking aspects of life in 2048 was the degree to which technology had become an integral part of nearly every aspect of daily existence. From the moment people woke up in the morning until they went to bed at night, they were surrounded by devices and systems that were powered by advanced artificial intelligence and machine learning algorithms. In fact, it was hard to find anything in people's lives that wasn't touched by technology in some way. Every aspect of society had been transformed, from the way people communicated with one another to the way they worked, played, and even socialized. And as the years went on, it seemed as though there was no limit to what technology could achieve. Despite all of these advances, however, not everyone was happy with the state of the world in 2048. Some people saw the increasing reliance on technology as a sign that humanity was losing touch with its own humanity, and they worried about the implications of this for the future. Others were more pragmatic, recognizing that while technology had brought many benefits, it also posed new challenges and risks that needed to be addressed. As a result, there was a growing movement of people who were working to ensure that the advances of technology were used in ways that were safe, ethical, and beneficial for everyone. One person who was at the forefront of this movement was a young woman named Maya. Maya was a brilliant and ambitious researcher who had dedicated her life to understanding the implications of emerging technologies like artificial intelligence and biotechnology. She was deeply concerned about the potential risks and unintended consequences of these technologies, and she worked tirelessly to raise awareness about the need for responsible innovation. Maya's work had earned her a reputation as one of the most influential voices in the field of technology and ethics, and she was widely respected for her deep understanding of the issues and her ability to communicate complex ideas in ways that were accessible and engaging. She was also known for her passionate and inspiring speeches, which often left her audiences with a sense of purpose and determination to make the world a better place through their own efforts. One day, Maya received an invitation to speak at a major conference on technology and ethics, which was being held in a large convention center in the heart of the city. The conference was expected to attract thousands of people from all over the world, and there was a great deal of excitement and anticipation about what Maya would say. As she prepared for her speech, Maya knew that she had a big responsibility on her shoulders. She felt a deep sense of obligation to use her platform to inspire others to take action and make a difference in the world, and she was determined to do everything in her power to live up to this responsibility. When the day of the conference arrived, Maya was filled with a mixture of excitement and nerves. She spent hours rehearsing her speech and fine-tuning her ideas, making sure that she had everything just right. Finally, after what felt like an eternity, it was time for her to take the stage. As she stepped up to the podium, Maya could feel the energy of the crowd surging around her. She took a deep breath and began to speak, her voice strong and clear as she outlined the challenges and opportunities facing society in the age of technology. She spoke passionately about the need for responsible innovation and the importance of considering the ethical implications of our actions, and she inspired many people in the audience to take up this cause and make a difference in their own lives. Overall, Maya's speech was a resounding success, and she received countless messages of gratitude and appreciation from those who had heard her speak. She knew that there was still much work to be done, but she felt hopeful about the future and the role that technology could play in creating a better world for all. As Maya left the stage and made her way back to her seat, she couldn't help but feel a sense of pride and accomplishment at what she had just accomplished. She knew that her words had the power to inspire others and make a real difference in the world, and she was grateful for the opportunity to have played a part in this important work. For Maya, the future was full of promise and possibility, and she was determined to continue doing everything in her power to help create a brighter, more ethical world for everyone. Please summary the key messages of this contents[/INST] Setting pad_token_id to eos_token_id:128001 for open-end generation. [ INFO ] [warm-up] Input token size: 1003, Output size: 64, Infer count: 64, Tokenization Time: 31.97ms, Detokenization Time: 8.28ms, Generation Time: 92.54s, Latency: 1445.89 ms/token [ INFO ] [warm-up] First token latency: 25313.54 ms/token, other tokens latency: 1066.21 ms/token, len of tokens: 64 1 [ INFO ] [warm-up] First infer latency: 25287.39 ms/infer, other infers latency: 1063.98 ms/infer, inference count: 64 [ INFO ] [warm-up] Max rss memory cost: 18264.07MBytes, [ INFO ] [warm-up] Result MD5:['18280cfd6989383a2a8756583f1133c7'] [ INFO ] [warm-up] Generated: [INST] <>In the year 2048, the world was a very different place from what it had been just two decades before. The pace of technological progress had quickened to an almost unimaginable degree, and the changes that had swept through society as a result were nothing short of revolutionary. In many ways, the year 2048 represented the culmination of a long and tumultuous journey that humanity had been on since the dawn of civilization. The great leaps forward in science and technology that had occurred over the course of the previous century had laid the groundwork for a future that was beyond anything anyone could have imagined. One of the most striking aspects of life in 2048 was the degree to which technology had become an integral part of nearly every aspect of daily existence. From the moment people woke up in the morning until they went to bed at night, they were surrounded by devices and systems that were powered by advanced artificial intelligence and machine learning algorithms. In fact, it was hard to find anything in people's lives that wasn't touched by technology in some way. Every aspect of society had been transformed, from the way people communicated with one another to the way they worked, played, and even socialized. And as the years went on, it seemed as though there was no limit to what technology could achieve. Despite all of these advances, however, not everyone was happy with the state of the world in 2048. Some people saw the increasing reliance on technology as a sign that humanity was losing touch with its own humanity, and they worried about the implications of this for the future. Others were more pragmatic, recognizing that while technology had brought many benefits,it also posed new challenges and risks that needed to be addressed. As a result, there was a growing movement of people who were working to ensure that the advances of technology were used in ways that were safe, ethical, and beneficial for everyone. One person who was at the forefront of this movement was a young woman named Maya. Maya was a brilliant and ambitious researcher who had dedicated her life to understanding the implications of emerging technologies like artificial intelligence and biotechnology. She was deeply concerned about the potential risks and unintended consequences of these technologies, and she worked tirelessly to raise awareness about the need for responsible innovation. Maya's work had earned her a reputation as one of the most influential voices in the field of technology and ethics, and she was widely respected for her deep understanding of the issues and her ability to communicate complex ideas in ways that were accessible and engaging. She was also known for her passionate and inspiring speeches, which often left her audiences with a sense of purpose and determination to make the world a better place through their own efforts. One day, Maya received an invitation to speak at a major conference on technology and ethics, which was being held in a large convention center in the heart of the city. The conference was expected to attract thousands of people from all over the world, and there was a great deal of excitement and anticipation about what Maya would say. As she prepared for her speech, Maya knew that she had a big responsibility on her shoulders. She felt a deep sense of obligation to use her platform to inspire others to take action and make a difference in the world, and she was determined to do everything in her power to live up to this responsibility. When the day of the conference arrived, Maya was filled with a mixture of excitement and nerves. She spent hours rehearsing her speech and fine-tuning her ideas, making sure that she had everything just right. Finally, after what felt like an eternity, it wastime for her to take the stage. As she stepped up to the podium, Maya could feel the energy of the crowd surging around her. She took a deep breath and began to speak, her voice strong and clear as she outlined the challenges and opportunities facing society in the age of technology. She spoke passionately about the need for responsible innovation and the importance of considering the ethical implications of our actions, and she inspired many people in the audience to take up this cause and make a difference in their own lives. Overall, Maya's speech was a resounding success, and she received countless messages of gratitude and appreciation from those who had heard her speak. She knew that there was still much work to be done, but she felt hopeful about the future and the role that technology could play in creating a better world for all. As Maya left the stage and made her way back to her seat, she couldn't help but feel a sense of pride and accomplishment at what she had just accomplished. She knew that her words had the power to inspire others and make a real difference in the world, and she was grateful for the opportunity to have played a part in this important work. For Maya, the future was full of promise and possibility, and she was determined to continue doing everything in her power to help create a brighter, more ethical world for everyone. Please summary the key messages of this contents[/INST] <>Summary: The story is set in the year 2048, where technology has become an integral part of daily life. The protagonist, Maya, is a researcher who is concerned about the potential risks and unintended consequences of emerging technologies like artificial intelligence and biotechnology. She is invited to speak at a majorconference Setting pad_token_id to eos_token_id:128001 for open-end generation. [ INFO ] [1] Input token size: 1003, Output size: 64, Infer count: 64, Tokenization Time: 3.45ms, Detokenization Time: 0.87ms, Generation Time: 81.46s, Latency: 1272.86 ms/token [ INFO ] [1] First token latency: 15787.35 ms/token, other tokens latency: 1042.45 ms/token, len of tokens: 64 1 [ INFO ] [1] First infer latency: 15785.69 ms/infer, other infers latency: 1040.72 ms/infer, inference count: 64 [ INFO ] [1] Max rss memory cost: 18492.65MBytes, [ INFO ] [1] Result MD5:['18280cfd6989383a2a8756583f1133c7'] [ INFO ] <<< Warm-up iteration is excluded. >>> [ INFO ] [Total] Iterations: 1 [ INFO ] [Average] Prompt[0] Input token size: 1003, 1st token lantency: 15787.35 ms/token, 2nd tokens latency: 1042.45 ms/token, 2nd tokens throughput: 0.96 tokens/s

avitial commented 3 months ago

@Edward-Lin can you share more details about the hardware you are using to run your test? What kind of performance were you expecting to get on this platform with such model? If possible please try the latest OpenVINO 2024.2 version and make sure you have the latest graphics driver as well, and see if that helps improve things on your side.

Performance may vary and depends on the platform used, I suggest to look at Performance Benchmarks to get perf insights for a selection of Large Language Models running on an Intel® Core™ Ultra 7-165H based system. Hope this helps.

Edward-Lin commented 3 months ago
  1. first of all, the page https://docs.openvino.ai/2024/about-openvino/performance-benchmarks/generative-ai-performance.htmll can not be opened with 404 error.
  2. I've updated all of them into 24.2, and the I don't know how to read the perf data, like below. For example, which is 1st token Latency, and which is 2nd average latency.

(env_ov_llm) C:\AIGC\openvino\openvino.genai\llm_bench\python>python benchmark.py -d GPU -m C:/AIGC/openvino/models/Meta-Llama-3-8B-Instruct-ov -n 3 -ic 64 -pf prompt\32.jsonl -mc=2 --torch_compile_backend openvino OpenVINO Tokenizer version is not compatible with OpenVINO version. Installed OpenVINO version: 2024.2.0,OpenVINO Tokenizers requires . OpenVINO Tokenizers models will not be added during export. INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino C:\ProgramData\anaconda3\envs\env_ov_llm\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " function 'cadam32bit_grad_fp32' not found [ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: llama [ INFO ] OV Config={'CACHE_DIR': ''} [ INFO ] OPENVINO_TORCH_BACKEND_DEVICE=$OPENVINO_TORCH_BACKEND_DEVICE [ INFO ] Model path=C:\AIGC\openvino\models\Meta-Llama-3-8B-Instruct-ov, openvino runtime version:2024.2.0-15519-5c0f38f83f6-releases/2024/2 Compiling the model to GPU ... [ WARNING ] The minimum version of transformers to get 1st and 2nd tokens latency of greedy searchis: 4.40.0 [ INFO ] From pretrained time: 12.02s Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [ INFO ] Read prompts from prompt\32.jsonl [ INFO ] Numbeams: 1, benchmarking iter nums(exclude warm-up): 3, prompt nums: 1, prompt idx: [0] [ INFO ] [warm-up] Input text: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. please continue this story [ INFO ] [warm-up] Input token size: 36, Output size: 64, Infer count: 0, Tokenization Time: 1.01ms, Detokenization Time: 0.64ms, Generation Time: 4.50s, Latency: 70.37 ms/token

Thanks,

Edward-Lin commented 3 months ago

my HW is U9 185H with 32GB

aoke79 commented 2 months ago

how to get the 1st Token latency form the logs like below? [ INFO ] [warm-up] Input token size: 36, Output size: 64, Infer count: 0, Tokenization Time: 1.01ms, Detokenization Time: 0.64ms, Generation Time: 4.50s, Latency: 70.37 ms/token

peterchen-intel commented 2 months ago

Performance Benchmarks : https://docs.openvino.ai/2024/about-openvino/performance-benchmarks/generative-ai-performance.html @aoke79 @Edward-Lin sounds like dependencies versions mis-matched Can you try following branch and install the verified requirements_2024.2.txt (with fixed versions for dependencies)? https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/2/llm_bench/python https://github.com/openvinotoolkit/openvino.genai/blob/releases/2024/2/llm_bench/python/requirements_2024.2.txt The logs should be something like following which shows the 1st and 2nd token latency.

[2024-07-05T11:31:28.051Z] [ INFO ] [warm-up] Input text: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to amazing places and meet new people, and have fun
[2024-07-05T11:32:06.830Z] [ INFO ] [warm-up] Input token size: 32, Output size: 128, Infer count: 128, Tokenization Time: 1.20ms, Detokenization Time: 0.33ms, Generation Time: 35.83s, Latency: 279.96 ms/token
[2024-07-05T11:32:06.830Z] [ INFO ] [warm-up] First token latency: 437.89 ms/token, other tokens latency: 278.68 ms/token, len of tokens: 128 * 1
[2024-07-05T11:32:06.830Z] [ INFO ] [warm-up] First infer latency: 435.89 ms/infer, other infers latency: 277.23 ms/infer, inference count: 128
[2024-07-05T11:32:06.830Z] [ INFO ] [warm-up] Max rss memory cost: 16454.42MBytes, 
[2024-07-05T11:32:06.830Z] [ INFO ] [warm-up] Result MD5:['3d14890abd1a98903a656aa2269cd316']
[2024-07-05T11:32:06.830Z] [ INFO ] [warm-up] Generated: 
aoke79 commented 2 months ago

(env_ov_genai) c:\AIGC\openvino\openvino.genai\llm_bench\python>python benchmark.py -d GPU -m C:\AIGC\openvino\models\TinyLlama-1.1B-Chat-v1.0\FP16 -n 1 -ic 64 -pf prompt\32.jsonl -mc=2 --torch_compile_backend openvino C:\ProgramData\anaconda3\envs\env_ov_genai\lib\site-packages\diffusers\models\transformers\transformer_2d.py:34: FutureWarning: Transformer2DModelOutput is deprecated and will be removed in version 1.0.0. Importing Transformer2DModelOutput from diffusers.models.transformer_2d is deprecated and this will be removed in a future version. Please use from diffusers.models.modeling_outputs import Transformer2DModelOutput, instead. deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message) INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino [ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: TinyLlama-1.1B-Chat-v1.0 [ INFO ] OV Config={'CACHE_DIR': ''} [ INFO ] OPENVINO_TORCH_BACKEND_DEVICE=$OPENVINO_TORCH_BACKEND_DEVICE [ INFO ] Model path=C:\AIGC\openvino\models\TinyLlama-1.1B-Chat-v1.0\FP16, openvino runtime version: 2024.2.0-15519-5c0f38f83f6-releases/2024/2 Compiling the model to GPU ... [ INFO ] From pretrained time: 14.59s [ INFO ] Read prompts from prompt\32.jsonl [ INFO ] Numbeams: 1, benchmarking iter nums(exclude warm-up): 1, prompt nums: 1, prompt idx: [0] [ INFO ] [warm-up] Input text: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. please continue this story [ INFO ] [warm-up] Input token size: 37, Output size: 64, Infer count: 64, Tokenization Time: 0.98ms, Detokenization Time: 0.28ms, Generation Time: 2.40s, Latency: 37.43 ms/token [ INFO ] [warm-up] First token latency: 95.90 ms/token, other tokens latency: 36.48 ms/token, len of tokens: 64 1 [ INFO ] [warm-up] First infer latency: 94.79 ms/infer, other infers latency: 35.90 ms/infer, inference count: 64 [ INFO ] [warm-up] Max rss memory cost: 5378.87MBytes, [ INFO ] [warm-up] Result MD5:['2330fd69d05be0bd4498f4db879a9f39'] [ INFO ] [warm-up] Generated: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. please continue this story. <|assistant|> Once upon a time, there was a little girl named Lily who loved to explore the world around her. She would often go on adventures with her best friend, a stuffed animal named Max. One day, Lily decided to go on a treasure [ INFO ] [1] Input token size: 37, Output size: 64, Infer count: 64, Tokenization Time: 0.22ms, Detokenization Time: 0.17ms, Generation Time: 2.34s, Latency: 36.58 ms/token [ INFO ] [1] First token latency: 76.22 ms/token, other tokens latency: 35.93 ms/token, len of tokens: 64 1 [ INFO ] [1] First infer latency: 75.70 ms/infer, other infers latency: 35.33 ms/infer, inference count: 64 [ INFO ] [1] Max rss memory cost: 5384.43MBytes, [ INFO ] [1] Result MD5:['2330fd69d05be0bd4498f4db879a9f39'] [ INFO ] <<< Warm-up iteration is excluded. >>> [ INFO ] [Total] Iterations: 1 [ INFO ] [Average] Prompt[0] Input token size: 37, 1st token lantency: 76.22 ms/token, 2nd tokens latency: 35.93 ms/token, 2nd tokens throughput: 27.83 tokens/s

peterchen-intel commented 2 months ago

@Edward-Lin What is the next? If following numbers are expected? If not, please create a ticket to request the targets. [ INFO ] [1] First infer latency: 75.70 ms/infer, other infers latency: 35.33 ms/infer, inference count: 64

avitial commented 1 month ago

Closing this, I hope previous responses were sufficient to help you proceed. Feel free to reopen and ask additional questions related to this topic.