mit-han-lab / TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library
https://mit-han-lab.github.io/TinyChatEngine/
MIT License
624 stars 58 forks source link

Assistant spitting out non-readable characters on RTX 4060 #71

Open zhefciad opened 8 months ago

zhefciad commented 8 months ago
(TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ make chat -j
CUDA is available!
src/Generate.cc src/LLaMATokenizer.cc src/OPTGenerate.cc src/OPTTokenizer.cc src/utils.cc src/nn_modules/Fp32OPTAttention.cc src/nn_modules/Fp32OPTDecoder.cc src/nn_modules/Fp32OPTDecoderLayer.cc src/nn_modules/Fp32OPTForCausalLM.cc src/nn_modules/Fp32llamaAttention.cc src/nn_modules/Fp32llamaDecoder.cc src/nn_modules/Fp32llamaDecoderLayer.cc src/nn_modules/Fp32llamaForCausalLM.cc src/nn_modules/Int4OPTAttention.cc src/nn_modules/Int4OPTDecoder.cc src/nn_modules/Int4OPTDecoderLayer.cc src/nn_modules/Int4OPTForCausalLM.cc src/nn_modules/Int8OPTAttention.cc src/nn_modules/Int8OPTDecoder.cc src/nn_modules/Int8OPTDecoderLayer.cc src/nn_modules/OPTForCausalLM.cc src/ops/BMM_F32T.cc src/ops/BMM_S8T_S8N_F32T.cc src/ops/BMM_S8T_S8N_S8T.cc src/ops/LayerNorm.cc src/ops/LayerNormQ.cc src/ops/LlamaRMSNorm.cc src/ops/RotaryPosEmb.cc src/ops/W8A8B8O8Linear.cc src/ops/W8A8B8O8LinearReLU.cc src/ops/W8A8BFP32OFP32Linear.cc src/ops/arg_max.cc src/ops/batch_add.cc src/ops/embedding.cc src/ops/linear.cc src/ops/softmax.cc ../kernels/matmul_imp.cc ../kernels/matmul_int4.cc ../kernels/matmul_int8.cc
../kernels/cuda/matmul_ref_fp32.cc ../kernels/cuda/matmul_ref_int8.cc
../kernels/cuda/gemv_cuda.cu ../kernels/cuda/matmul_int4.cu  src/nn_modules/cuda/Int4llamaAttention.cu src/nn_modules/cuda/Int4llamaDecoder.cu src/nn_modules/cuda/Int4llamaDecoderLayer.cu src/nn_modules/cuda/Int4llamaForCausalLM.cu src/nn_modules/cuda/LLaMAGenerate.cu src/nn_modules/cuda/utils.cu src/ops/cuda/BMM_F16T.cu src/ops/cuda/LlamaRMSNorm.cu src/ops/cuda/RotaryPosEmb.cu src/ops/cuda/batch_add.cu src/ops/cuda/embedding.cu src/ops/cuda/linear.cu src/ops/cuda/softmax.cu
make: 'chat' is up to date.
(TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ ./chat
TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: Hi, I'm Jeff!
ASSISTANT:

 #
$  ⸮#

#" ⁇ $
   $!!$
        ⁇ "

"!!" #         !
$
         ! !    #

!⸮
$       !$$
"##!
 ⁇ ⸮ ⁇  $ ⁇

        $"!" ⁇  #

        ⸮#
"

⸮
        $ ⁇

#        $
 "# ⁇  ⁇ ##
⸮#!"!"
$!"!" !"

Inference latency, Total time: 40.5 s, 73.9 ms/token, 13.5 token/s, 548 tokens
USER:

I have an RTX 4060 Windows Laptop and ran this with WSL Ubuntu. Modified the Makefile to match my computing capability (89). Anything I did wrong or it's still not supported?

dt1729 commented 7 months ago

GTX 1070 and same issue

Screenshot from 2023-11-10 17-43-26