wangzhaode / mnn-llm

llm deploy project based mnn.
Apache License 2.0
1.42k stars 154 forks source link

llama3 model can not answer #189

Closed hyperbolic-c closed 3 months ago

hyperbolic-c commented 4 months ago

when I run the llama3 mnn model

(py_llama) st@server03:~/mnn-llm$ ./build/cli_demo ./models/llama3/
model path is ./models/llama3/
### model name : Llama3_8b
The device support i8sdot:0, support fp16:0, support i8mm: 0
load tokenizer
load tokenizer Done
### disk embedding is 1
[ 10% ] load ./models/llama3//lm.mnn model ... Done!
[ 15% ] load ./models/llama3//block_0.mnn model ... Done!
[ 18% ] load ./models/llama3//block_1.mnn model ... Done!
[ 21% ] load ./models/llama3//block_2.mnn model ... Done!
[ 23% ] load ./models/llama3//block_3.mnn model ... Done!
[ 26% ] load ./models/llama3//block_4.mnn model ... Done!
[ 29% ] load ./models/llama3//block_5.mnn model ... Done!
[ 31% ] load ./models/llama3//block_6.mnn model ... Done!
[ 34% ] load ./models/llama3//block_7.mnn model ... Done!
[ 36% ] load ./models/llama3//block_8.mnn model ... Done!
[ 39% ] load ./models/llama3//block_9.mnn model ... Done!
[ 42% ] load ./models/llama3//block_10.mnn model ... Done!
[ 44% ] load ./models/llama3//block_11.mnn model ... Done!
[ 47% ] load ./models/llama3//block_12.mnn model ... Done!
[ 50% ] load ./models/llama3//block_13.mnn model ... Done!
[ 52% ] load ./models/llama3//block_14.mnn model ... Done!
[ 55% ] load ./models/llama3//block_15.mnn model ... Done!
[ 58% ] load ./models/llama3//block_16.mnn model ... Done!
[ 60% ] load ./models/llama3//block_17.mnn model ... Done!
[ 63% ] load ./models/llama3//block_18.mnn model ... Done!
[ 66% ] load ./models/llama3//block_19.mnn model ... Done!
[ 68% ] load ./models/llama3//block_20.mnn model ... Done!
[ 71% ] load ./models/llama3//block_21.mnn model ... Done!
[ 74% ] load ./models/llama3//block_22.mnn model ... Done!
[ 76% ] load ./models/llama3//block_23.mnn model ... Done!
[ 79% ] load ./models/llama3//block_24.mnn model ... Done!
[ 81% ] load ./models/llama3//block_25.mnn model ... Done!
[ 84% ] load ./models/llama3//block_26.mnn model ... Done!
[ 87% ] load ./models/llama3//block_27.mnn model ... Done!
[ 89% ] load ./models/llama3//block_28.mnn model ... Done!
[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!

then to ask it returns

Q: who are you

A: You're asking "who"?

#################################
 total tokens num  = 20
prompt tokens num  = 13
output tokens num  = 7
  total time = 2.59 s
prefill time = 1.31 s
 decode time = 1.28 s
  total speed = 7.73 tok/s
prefill speed = 9.92 tok/s
 decode speed = 5.48 tok/s
   chat speed = 2.71 tok/s
##################################

Q:
A: You're asking "are"?

#################################
 total tokens num  = 39
prompt tokens num  = 32
output tokens num  = 7
  total time = 4.21 s
prefill time = 2.81 s
 decode time = 1.41 s
  total speed = 9.26 tok/s
prefill speed = 11.40 tok/s
 decode speed = 4.98 tok/s
   chat speed = 1.66 tok/s
##################################

Q:
A: You're asking "you"?

#################################
 total tokens num  = 58
prompt tokens num  = 51
output tokens num  = 7
  total time = 4.82 s
prefill time = 3.48 s
 decode time = 1.34 s
  total speed = 12.04 tok/s
prefill speed = 14.64 tok/s
 decode speed = 5.24 tok/s
   chat speed = 1.45 tok/s
##################################

Q: introduce Beijing

A: You're asking "introduce"?

#################################
 total tokens num  = 84
prompt tokens num  = 76
output tokens num  = 8
  total time = 6.32 s
prefill time = 5.19 s
 decode time = 1.14 s
  total speed = 13.29 tok/s
prefill speed = 14.66 tok/s
 decode speed = 7.04 tok/s
   chat speed = 1.27 tok/s
##################################

Q:
A: You're asking "Beijing"?

#################################
 total tokens num  = 108
prompt tokens num  = 100
output tokens num  = 8
  total time = 7.68 s
prefill time = 6.51 s
 decode time = 1.17 s
  total speed = 14.06 tok/s
prefill speed = 15.37 tok/s
 decode speed = 6.81 tok/s
   chat speed = 1.04 tok/s
##################################

Any solution? Thanks !!

hyperbolic-c commented 4 months ago

When I use benchmark it can return correctly

[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!
prompt file is ./resource/prompt.txt
### warmup ... Done
It's great to chat with you! How are you doing today?
哈哈!我是 ChatGPT,一个人工智能语言模型!
I'm just an AI, I don't have access to real-time weather information. However, you can check the weather forecast online or on your local weather app to get an idea of the current weather conditions.

#################################
prompt tokens num  = 54
decode tokens num  = 77
prefill time = 3.85 s
 decode time = 12.91 s
prefill speed = 14.02 tok/s
 decode speed = 5.96 tok/s
##################################

It looks like llama3 only can response with llm->response(prompts[i]), not chat with llm->chat() ? @wangzhaode Do you have any suggestions, please!

github-actions[bot] commented 3 months ago

Marking as stale. No activity in 30 days.