Support to Tesla P100 GPU inference

songkq commented 10 months ago

Hi, when I run TinyChatEngine with ./chat LLaMA2_7B_chat int4 on P100 GPU, It generates some bad results. Could you please give some advice for this issue?

Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: hello
ASSISTANT:

 #
$  #

#" ⁇ $
  $!!$
       ⁇ "

"!!" #         !
$
         ! !    #

!
$   !$$
"##!
 ⁇  ⁇   $ ⁇

        $"!" ⁇  #

        #
"

        $ ⁇

#    $
 "# ⁇  ⁇ ##
#!"!"
$!"!"!"

Inference latency, Total time: 7.9 s, 14.5 ms/token, 69.0 token/s, 548 tokens

atomicrajat commented 10 months ago

Hi @songkq , Before running the make script edit the Makefile and change the -arch=sm_86 to value corresponding to your GPUs compute capability (Tesla P100 has 6.0). You can check here : https://developer.nvidia.com/cuda-gpus . I had encountered same problem with GTX 1050Ti GPU which has compute capability of 6.1 hence i had to replace with sm_61 but with this change the application does not compile as it prompts that some operations need a minimum of sm_75 and above.

Note: without changing the compute capability the application compiles assuming the operations can run on your GPU but while inference the values get corrupted hence the above output.

songkq commented 9 months ago

@atomicrajat Thanks. It also cannot work with -arch=sm_60.

CUDA is available!
src/Generate.cc src/LLaMATokenizer.cc src/OPTGenerate.cc src/OPTTokenizer.cc src/utils.cc src/nn_modules/Fp32llamaAttention.cc src/nn_modules/Fp32llamaDecoder.cc src/nn_modules/Fp32llamaDecoderLayer.cc src/nn_modules/Fp32llamaForCausalLM.cc src/nn_modules/Fp32OPTAttention.cc src/nn_modules/Fp32OPTDecoder.cc src/nn_modules/Fp32OPTDecoderLayer.cc src/nn_modules/Fp32OPTForCausalLM.cc src/nn_modules/Int4OPTAttention.cc src/nn_modules/Int4OPTDecoder.cc src/nn_modules/Int4OPTDecoderLayer.cc src/nn_modules/Int4OPTForCausalLM.cc src/nn_modules/Int8OPTAttention.cc src/nn_modules/Int8OPTDecoder.cc src/nn_modules/Int8OPTDecoderLayer.cc src/nn_modules/OPTForCausalLM.cc src/ops/arg_max.cc src/ops/batch_add.cc src/ops/BMM_F32T.cc src/ops/BMM_S8T_S8N_F32T.cc src/ops/BMM_S8T_S8N_S8T.cc src/ops/embedding.cc src/ops/LayerNorm.cc src/ops/LayerNormQ.cc src/ops/linear.cc src/ops/LlamaRMSNorm.cc src/ops/RotaryPosEmb.cc src/ops/softmax.cc src/ops/W8A8B8O8Linear.cc src/ops/W8A8B8O8LinearReLU.cc src/ops/W8A8BFP32OFP32Linear.cc ../kernels/matmul_imp.cc ../kernels/matmul_int4.cc ../kernels/matmul_int8.cc
../kernels/cuda/matmul_ref_fp32.cc ../kernels/cuda/matmul_ref_int8.cc
../kernels/cuda/matmul_cuda.cu ../kernels/cuda/matmul_int4.cu  src/nn_modules/cuda/Int4llamaAttention.cu src/nn_modules/cuda/Int4llamaDecoder.cu src/nn_modules/cuda/Int4llamaDecoderLayer.cu src/nn_modules/cuda/Int4llamaForCausalLM.cu src/nn_modules/cuda/LLaMAGenerate.cu src/nn_modules/cuda/utils.cu src/ops/cuda/batch_add.cu src/ops/cuda/BMM_F16T.cu src/ops/cuda/embedding.cu src/ops/cuda/linear.cu src/ops/cuda/LlamaRMSNorm.cu src/ops/cuda/RotaryPosEmb.cu src/ops/cuda/softmax.cu
/usr/local/cuda/bin/nvcc -std=c++17 -Xptxas -O3 -use_fast_math -Xcompiler "-pthread" -DQM_CUDA -arch=sm_60 --forward-unknown-to-host-compiler -Xcompiler "-mavx2" -mfma -ffast-math -fpermissive -DQM_x86 -I../kernels -I./include -I./include/nn_modules -I./json/single_include/ -I./half-2.2.0/include/ -I./include/ops/cuda -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -I/usr/include/x86_64-linux-gnu -o chat application/chat.cc build/transformer/src/Generate.o build/transformer/src/LLaMATokenizer.o build/transformer/src/OPTGenerate.o build/transformer/src/OPTTokenizer.o build/transformer/src/utils.o build/transformer/src/nn_modules/Fp32llamaAttention.o build/transformer/src/nn_modules/Fp32llamaDecoder.o build/transformer/src/nn_modules/Fp32llamaDecoderLayer.o build/transformer/src/nn_modules/Fp32llamaForCausalLM.o build/transformer/src/nn_modules/Fp32OPTAttention.o build/transformer/src/nn_modules/Fp32OPTDecoder.o build/transformer/src/nn_modules/Fp32OPTDecoderLayer.o build/transformer/src/nn_modules/Fp32OPTForCausalLM.o build/transformer/src/nn_modules/Int4OPTAttention.o build/transformer/src/nn_modules/Int4OPTDecoder.o build/transformer/src/nn_modules/Int4OPTDecoderLayer.o build/transformer/src/nn_modules/Int4OPTForCausalLM.o build/transformer/src/nn_modules/Int8OPTAttention.o build/transformer/src/nn_modules/Int8OPTDecoder.o build/transformer/src/nn_modules/Int8OPTDecoderLayer.o build/transformer/src/nn_modules/OPTForCausalLM.o build/transformer/src/ops/arg_max.o build/transformer/src/ops/batch_add.o build/transformer/src/ops/BMM_F32T.o build/transformer/src/ops/BMM_S8T_S8N_F32T.o build/transformer/src/ops/BMM_S8T_S8N_S8T.o build/transformer/src/ops/embedding.o build/transformer/src/ops/LayerNorm.o build/transformer/src/ops/LayerNormQ.o build/transformer/src/ops/linear.o build/transformer/src/ops/LlamaRMSNorm.o build/transformer/src/ops/RotaryPosEmb.o build/transformer/src/ops/softmax.o build/transformer/src/ops/W8A8B8O8Linear.o build/transformer/src/ops/W8A8B8O8LinearReLU.o build/transformer/src/ops/W8A8BFP32OFP32Linear.o build/transformer/../kernels/matmul_imp.o build/transformer/../kernels/matmul_int4.o build/transformer/../kernels/matmul_int8.o build/transformer/../kernels/cuda/matmul_ref_fp32.o build/transformer/../kernels/cuda/matmul_ref_int8.o build/transformer/../kernels/cuda/matmul_cuda.o build/transformer/../kernels/cuda/matmul_int4.o build/transformer/src/nn_modules/cuda/Int4llamaAttention.o build/transformer/src/nn_modules/cuda/Int4llamaDecoder.o build/transformer/src/nn_modules/cuda/Int4llamaDecoderLayer.o build/transformer/src/nn_modules/cuda/Int4llamaForCausalLM.o build/transformer/src/nn_modules/cuda/LLaMAGenerate.o build/transformer/src/nn_modules/cuda/utils.o build/transformer/src/ops/cuda/batch_add.o build/transformer/src/ops/cuda/BMM_F16T.o build/transformer/src/ops/cuda/embedding.o build/transformer/src/ops/cuda/linear.o build/transformer/src/ops/cuda/LlamaRMSNorm.o build/transformer/src/ops/cuda/RotaryPosEmb.o build/transformer/src/ops/cuda/softmax.o  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -lnvrtc -lcuda -lcurand -lcusolver -L/usr/local/cuda/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/x86_64-linux-gnu -Xlinker -rpath=/usr/local/cuda/lib64 -Xlinker -rpath=/usr/local/cuda/targets/x86_64-linux/lib -Xlinker -rpath=/usr/lib/x86_64-linux-gnu
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt

TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: hello
ASSISTANT:

 #
$  #

#" ⁇ $
  $!!$
       ⁇ "

"!!" #         !
$
         ! !    #

!
$   !$$
"##!
 ⁇  ⁇   $ ⁇

        $"!" ⁇  #

        #
"

        $ ⁇

#    $
 "# ⁇  ⁇ ##
#!"!"
$!"!"!"

Inference latency, Total time: 7.5 s, 13.7 ms/token, 73.1 token/s, 548 tokens

RaymondWang0 commented 9 months ago

Hi @songkq and @atomicrajat, thank you for your interest in our work. You guys are correct - for Nvidia GPU, our CUDA backend may not support Nvidia GPUs with compute capability <= 7.5. We've added clearer instructions regarding this in README.

Meanwhile, we will release a new version to support Nvidia GPUs with lower compute capability soon, please stay tuned!

tuobulatuo commented 9 months ago

Hi @songkq and @atomicrajat, thank you for your interest in our work. You guys are correct - for Nvidia GPU, our CUDA backend may not support Nvidia GPUs with compute capability <= 7.5. We've added clearer instructions regarding this in README.

Meanwhile, we will release a new version to support Nvidia GPUs with lower compute capability soon, please stay tuned!

Can't wait for this update, thanks!

RaymondWang0 commented 9 months ago

Hi @songkq, @atomicrajat and @tuobulatuo, thanks for your patience in waiting. We just released the new version of our CUDA implementation. We've tested it on various GPUs and verified that it can now work on GPUs down to compute capability 6.1, such as GTX 1080 Ti and TITAN Xp. Our tests on GPUs with higher compute capabilities, such as GTX 4090, 3090, 2080 Ti, RTX A6000, and Jetson AGX Orin, have also been successful.

While we anticipate that the Tesla P100 GPU with compute capability 6.0 should be compatible, we can't guarantee it as we haven't been able to test it directly. Please feel free to try it out and give us feedback.

Last but not least, our new CUDA implementation boosts text generation speeds by approximately 40%, though this may vary across different GPUs. Feel free to give it a try! We'll continue to optimize TinyChatEngine's performance and support more features/models, so please stay tuned for more updates. Thanks!

mit-han-lab / TinyChatEngine

Support to Tesla P100 GPU inference #58