intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.71k stars 1.26k forks source link

long model loading time for llava 1.1 model 1.5 #10440

Closed aoke79 closed 8 months ago

aoke79 commented 8 months ago

Dear, I used llava release v1.1.0 version from https://github.com/haotian-liu/LLaVA/releases/tag/v1.1.0, and model v1.5-7b from https://huggingface.co/liuhaotian/llava-v1.5-7b, and follow the sample code https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/PyTorch-Models/Model/llava to run it successfully on MTL GPU. it's cool, thanks very much. on the other hand, there are some issues here:

  1. the model loading time is a little long. LlavaLlamaForCausalLM.from_pretrained() is about 60s, on CPU, not sure the precision. vision_tower.load_model() is about 101s on CPU on FP32. optimize_model() is about 18s on xPU, not sure precision. the total time is about 174s before Q-A.
  2. the V-RAM utilization is increasing with multi-round Q-A. it will cause the app crash.
  3. there are some extra information on the model loading, here is the logs FYI:

(env_bigdl_mtl) C:\AIGC\bigdl\BigDL-main\python\llm\example\GPU\PyTorch-Models\Model\llava\LLaVA-1.1.1>call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" :: initializing oneAPI environment... Initializing Visual Studio command-line environment... Visual Studio version 17.8.4 environment configured. "C:\Program Files\Microsoft Visual Studio\2022\Community\" Visual Studio command-line environment initialized for: 'x64' : advisor -- latest : compiler -- latest : dal -- latest : debugger -- latest : dev-utilities -- latest : dnnl -- latest : dpcpp-ct -- latest : dpl -- latest : ipp -- latest : ippcp -- latest : mkl -- latest : tbb -- latest : vtune -- latest :: oneAPI environment initialized :: C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-03-14 23:27:03,646 - INFO - intel_extension_for_pytorch auto imported Loading checkpoint shards: 100%|████████████████████████████████████| 2/2 [00:30<00:00, 15.32s/it] Some weights of the model checkpoint at openai/clip-vit-large-patch14-336 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.embeddings.position_embedding.weight', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.final_layer_norm.weight', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.5.layer_norm1.weight', 'text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.encoder.layers.9.mlp.fc1.weight', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.5.layer_norm2.weight', 'visual_projection.weight', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.q_proj.bias', 'text_projection.weight', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.10.layer_norm2.weight', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.final_layer_norm.bias', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.5.mlp.fc2.bias', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'logit_scale', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.5.mlp.fc2.weight', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.embeddings.position_ids', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.embeddings.token_embedding.weight', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.7.layer_norm1.bias', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.3.layer_norm2.weight']

-------------------- Status is -------------------- INFO: input token count: 51 INFO: output token count: 114 INFO: token Time (ms): 176.53 -------------------- end -------------------- ASSISTANT: Yes, this painting was created by Leonardo da Vinci. This Renaissance-era artwork is known as a portrait of a woman, often referred to as the Mona Lisa. It is renowned for its unique smile, captured on a woman's face, which has contributed to making the image iconic. The painting features the face of the woman and does not show her figure, as is the case in the scene presented in the image. The style of the painting is characterized by the high contrast and realistic portrayal of the subject. USER: Can you describe this painting? INFO: tokenizer image Time (ms): 3.88 The painting is an iconic portrait of a woman, which has been widely celebrated and cherished over the years due to its captivating subject matter and the artist's unrivaled skill. It features a close-up shot of the woman's face, with a focus on her expression. Her enigmatic and serene smile has become synonymous with the painting itself.

The artist's style can be characterized by the use of high contrast and fine attention to detail, which allows him to capture the nuances of the subject's features. The realism of the woman's appearance in the painting further highlights the artist's skill and talent.

In terms of the content, the image does not show the woman's figure, and instead, it showcases just the woman's face and expression. This adds a sense of intimacy and depth to the painting, which is why it has become a recognizable and enduring piece of art.

-------------------- Status is -------------------- INFO: input token count: 179 INFO: output token count: 205 INFO: token Time (ms): 103.64 -------------------- end -------------------- ASSISTANT: The painting is an iconic portrait of a woman, which has been widely celebrated and cherished over the years due to its captivating subject matter and the artist's unrivaled skill. It features a close-up shot of the woman's face, with a focus on her expression. Her enigmatic and serene smile has become synonymous with the painting itself.

The artist's style can be characterized by the use of high contrast and fine attention to detail, which allows him to capture the nuances of the subject's features. The realism of the woman's appearance in the painting further highlights the artist's skill and talent.

In terms of the content, the image does not show the woman's figure, and instead, it showcases just the woman's face and expression. This adds a sense of intimacy and depth to the painting, which is why it has become a recognizable and enduring piece of art. USER: what should I know about the painting? INFO: tokenizer image Time (ms): 4.60 The painting in question is a world-famous portrait, made by the prominent Italian artist Leonardo da Vinci during the Renaissance era. The artwork, often referred to as La Gioconda or the Mona Lisa, has become one of the most iconic depictions of a person and is recognized globally due to its historical, artistic, and cultural significance. The painting originally started as a charcoal sketch, and later it was developed into an oil painting.

The image showcases a close-up of the painting itself, focusing on the woman's face and expression. The woman's serene smile and her enigmatic gaze have become synonymous with the portrait and have contributed to its enduring popularity. The painting is notable for its detailed representation of the subject, and the artist's mastery of shading and contrasting techniques.

The portrait has been the subject of numerous discussions, interpretations, and even various theories surrounding the identity of the woman depicted. Despite these debates and multiple reproductions, the original, still surviving artwork continues to captivate viewers and remains an essential piece of global art history.

-------------------- Status is -------------------- INFO: input token count: 400 INFO: output token count: 249 INFO: token Time (ms): 117.18 -------------------- end -------------------- ASSISTANT: The painting in question is a world-famous portrait, made by the prominent Italian artist Leonardo da Vinci during the Renaissance era. The artwork, often referred to as La Gioconda or the Mona Lisa, has become one of the most iconic depictions of a person and is recognized globally due to its historical, artistic, and cultural significance. The painting originally started as a charcoal sketch, and later it was developed into an oil painting.

The image showcases a close-up of the painting itself, focusing on the woman's face and expression. The woman's serene smile and her enigmatic gaze have become synonymous with the portrait and have contributed to its enduring popularity. The painting is notable for its detailed representation of the subject, and the artist's mastery of shading and contrasting techniques.

The portrait has been the subject of numerous discussions, interpretations, and even various theories surrounding the identity of the woman depicted. Despite these debates and multiple reproductions, the original, still surviving artwork continues to captivate viewers and remains an essential piece of global art history. USER: is there any other paintings like this? INFO: tokenizer image Time (ms): 5.16 Yes, there are several similar paintings in terms of the subject matter but with varying executions and styles. Here are some notable examples:

  1. Girl with a Pearl Earring by Johannes Vermeer (17th century): This Baroque-era painting features a young girl in an Eastern-style turban and pearl earrings, sitting in a room with a sconce. The artwork is renowned for its striking contrast in light and shadow, and the girl's enigmatic gaze.
  2. Portrait of a Lady by the Master (Sandro Botticelli) (15th century): This piece of art also known as La Primavera is a Renaissance painting with a woman in a white dress and the masterpiece features Spring deities in background.
  3. The Mummy Project: The Man in the Mummy Case by Walter Sickert (1903): This artwork features a mummified Egyptian man behind glass, painted in an Impressionistic style.

These and other paintings share elements such as the focused subjects, unique expressions, and even portrayals of historical or mythological figures. However, without more context, it is difficult to ascertain the precise connection or relation between these pieces of art. Nonetheless, they display similarities in terms of their themes and interpretations, making them notable works Traceback (most recent call last): File "C:\AIGC\bigdl\BigDL-main\python\llm\example\GPU\PyTorch-Models\Model\llava\LLaVA-1.1.1\generate.py", line 342, in output_ids = model.generate( File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\transformers\generation\utils.py", line 1485, in generate return self.sample( File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\transformers\generation\utils.py", line 2524, in sample outputs = self( File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\AIGC\bigdl\BigDL-main\python\llm\example\GPU\PyTorch-Models\Model\llava\LLaVA-1.1.1\llava\model\language_model\llava_llama.py", line 78, in forward outputs = self.model( File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\ProgramData\anaconda3\envs\env_bigdl_mtl\lib\site-packages\transformers\models\llama\modeling_llama.py", line 227, in forward attn_weights = attn_weights + attention_mask RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)

(env_bigdl_mtl) C:\AIGC\bigdl\BigDL-main\python\llm\example\GPU\PyTorch-Models\Model\llava\LLaVA-1.1.1>

aoke79 commented 8 months ago

V-RAM utilization:

  1. initial GPU V-RAM is 0.9G
  2. after model loading is 5.3G
  3. first round is 6.7G
  4. second round is 7.5G
  5. third round is 9.2G
  6. fourth round is 11.9G
  7. fifth round is up to 15.4, and then crash
aoke79 commented 8 months ago

generate.txt attached the code, I modified. please change the ext name from "txt" to "py"

aoke79 commented 8 months ago

I updated the bigdl to the latest one, this issue is gone. it's good! and Might you please help the time/token compute is correct or not? Thanks a lot,

aoke79 commented 8 months ago

currently the perf is good for me, might you please have a try, to confirm what I did? Thanks a ton log_code_1.1_model_1.5_0318_fp32_02.txt

WeiguangHan commented 8 months ago

I think the generation time of all tokens is correct. However, the generation time consists of first token and the rest token latency, which should better be benchmarked separately. You can refer to our benchmark to add a wrapper to print both first token and rest token performance.