Open zhefciad opened 8 months ago
(TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ make chat -j CUDA is available! src/Generate.cc src/LLaMATokenizer.cc src/OPTGenerate.cc src/OPTTokenizer.cc src/utils.cc src/nn_modules/Fp32OPTAttention.cc src/nn_modules/Fp32OPTDecoder.cc src/nn_modules/Fp32OPTDecoderLayer.cc src/nn_modules/Fp32OPTForCausalLM.cc src/nn_modules/Fp32llamaAttention.cc src/nn_modules/Fp32llamaDecoder.cc src/nn_modules/Fp32llamaDecoderLayer.cc src/nn_modules/Fp32llamaForCausalLM.cc src/nn_modules/Int4OPTAttention.cc src/nn_modules/Int4OPTDecoder.cc src/nn_modules/Int4OPTDecoderLayer.cc src/nn_modules/Int4OPTForCausalLM.cc src/nn_modules/Int8OPTAttention.cc src/nn_modules/Int8OPTDecoder.cc src/nn_modules/Int8OPTDecoderLayer.cc src/nn_modules/OPTForCausalLM.cc src/ops/BMM_F32T.cc src/ops/BMM_S8T_S8N_F32T.cc src/ops/BMM_S8T_S8N_S8T.cc src/ops/LayerNorm.cc src/ops/LayerNormQ.cc src/ops/LlamaRMSNorm.cc src/ops/RotaryPosEmb.cc src/ops/W8A8B8O8Linear.cc src/ops/W8A8B8O8LinearReLU.cc src/ops/W8A8BFP32OFP32Linear.cc src/ops/arg_max.cc src/ops/batch_add.cc src/ops/embedding.cc src/ops/linear.cc src/ops/softmax.cc ../kernels/matmul_imp.cc ../kernels/matmul_int4.cc ../kernels/matmul_int8.cc ../kernels/cuda/matmul_ref_fp32.cc ../kernels/cuda/matmul_ref_int8.cc ../kernels/cuda/gemv_cuda.cu ../kernels/cuda/matmul_int4.cu src/nn_modules/cuda/Int4llamaAttention.cu src/nn_modules/cuda/Int4llamaDecoder.cu src/nn_modules/cuda/Int4llamaDecoderLayer.cu src/nn_modules/cuda/Int4llamaForCausalLM.cu src/nn_modules/cuda/LLaMAGenerate.cu src/nn_modules/cuda/utils.cu src/ops/cuda/BMM_F16T.cu src/ops/cuda/LlamaRMSNorm.cu src/ops/cuda/RotaryPosEmb.cu src/ops/cuda/batch_add.cu src/ops/cuda/embedding.cu src/ops/cuda/linear.cu src/ops/cuda/softmax.cu make: 'chat' is up to date. (TinyChatEngine) zhef@zhef:~/TinyChatEngine/llm$ ./chat TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine Using model: LLaMA2_7B_chat Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq Loading model... Finished! USER: Hi, I'm Jeff! ASSISTANT: # $ ⸮# #" ⁇ $ $!!$ ⁇ " "!!" # ! $ ! ! # !⸮ $ !$$ "##! ⁇ ⸮ ⁇ $ ⁇ $"!" ⁇ # ⸮# " ⸮ $ ⁇ # $ "# ⁇ ⁇ ## ⸮#!"!" $!"!" !" Inference latency, Total time: 40.5 s, 73.9 ms/token, 13.5 token/s, 548 tokens USER:
I have an RTX 4060 Windows Laptop and ran this with WSL Ubuntu. Modified the Makefile to match my computing capability (89). Anything I did wrong or it's still not supported?
GTX 1070 and same issue
I have an RTX 4060 Windows Laptop and ran this with WSL Ubuntu. Modified the Makefile to match my computing capability (89). Anything I did wrong or it's still not supported?