microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
340 stars 27 forks source link

Cannot compile your llama.cpp with CUDA support on AGX Orin #8

Closed Zijie-Tian closed 1 month ago

Zijie-Tian commented 1 month ago

I am trying to reproduce your experimental results, so I am testing with the llama.cpp you provided. However, when I enable CUDA compute capability using the following command, the following error occurs.

cmake .. -DLLAMA_TMAC=ON -DLLAMA_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_PREFIX_PATH=${TMAC_ROOT_DIR}/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++

Output as following :

-- The C compiler identification is Clang 17.0.6
-- The CXX compiler identification is Clang 17.0.6
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /home/tzj/Code/T-MAC/build/clang+llvm-17.0.6-aarch64-linux-gnu/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/tzj/Code/T-MAC/build/clang+llvm-17.0.6-aarch64-linux-gnu/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/local/cuda-12.2/include (found version "12.2.140")
-- CUDA found
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda-12.2/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 87
-- TMAC found
-- CUDA host compiler is GNU 11.4.0

-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- ARM detected
-- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- Configuring done (4.0s)
-- Generating done (0.2s)
-- Build files have been written to: /home/tzj/Code/T-MAC/3rdparty/llama.cpp/build
[  3%] Generating build details from Git
[  3%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 13%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
[ 13%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o
[ 13%] Building C object CMakeFiles/ggml.dir/ggml.c.o
[ 13%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/alibi.cu.o
[ 13%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[ 13%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/argsort.cu.o
[ 17%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/binbcast.cu.o
[ 20%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/clamp.cu.o
-- Found Git: /usr/bin/git (found version "2.34.1")
[ 24%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/concat.cu.o
[ 24%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/arange.cu.o
[ 27%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/convert.cu.o
[ 27%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/cpy.cu.o
[ 31%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/diagmask.cu.o
[ 31%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/dmmv.cu.o
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
[ 34%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.o
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:148: CMakeFiles/ggml.dir/ggml-cuda/alibi.cu.o] Error 1
gmake[3]: *** Waiting for unfinished jobs....
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:178: CMakeFiles/ggml.dir/ggml-cuda/argsort.cu.o] Error 1
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:163: CMakeFiles/ggml.dir/ggml-cuda/arange.cu.o] Error 1
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:238: CMakeFiles/ggml.dir/ggml-cuda/convert.cu.o] Error 1
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:193: CMakeFiles/ggml.dir/ggml-cuda/binbcast.cu.o] Error 1
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:223: CMakeFiles/ggml.dir/ggml-cuda/concat.cu.o] Error 1
[ 37%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/im2col.cu.o
[ 41%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/mmq.cu.o
[ 41%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/getrows.cu.o
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:133: CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o] Error 1
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:208: CMakeFiles/ggml.dir/ggml-cuda/clamp.cu.o] Error 1
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:253: CMakeFiles/ggml.dir/ggml-cuda/cpy.cu.o] Error 1
[ 41%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/mmvq.cu.o
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:283: CMakeFiles/ggml.dir/ggml-cuda/dmmv.cu.o] Error 1
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:298: CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.o] Error 1
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:328: CMakeFiles/ggml.dir/ggml-cuda/im2col.cu.o] Error 1
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:343: CMakeFiles/ggml.dir/ggml-cuda/mmq.cu.o] Error 1
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:268: CMakeFiles/ggml.dir/ggml-cuda/diagmask.cu.o] Error 1
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:313: CMakeFiles/ggml.dir/ggml-cuda/getrows.cu.o] Error 1
cc1plus: error: unknown value ‘armv8.2a+fp16’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8-r native
gmake[3]: *** [CMakeFiles/ggml.dir/build.make:358: CMakeFiles/ggml.dir/ggml-cuda/mmvq.cu.o] Error 1
[ 41%] Generating build details from Git

Could you please let me know how to modify llama.cpp and compile it with CUDA, or provide instructions on how to achieve your CUDA Baseline? If possible, it would be great if you could provide a complete script for Power experiment.

截屏2024-07-05 20 48 17
caoshijie0501 commented 1 month ago

Hi @Zijie-Tian , thanks for your interest. T-MAC needs to use clang compiler for cpu,while cuda needs nvcc compiler, so it is recommended to set "LLAMA_TMAC=OFF" when compiling cuda.

Zijie-Tian commented 1 month ago

Hi, @caoshijie0501 Sorry for late response, I am very interested in your work. Using your method, I was able to compile llama-bench and main with CUDA now.

However, after reading the paper, I have the following questions for you:

  1. Regarding extremely low-bit precision, I am curious about how you setup your baseline. For example, I saw the results for llama (W2) in your profile_data.md. How is these special precision executed on the CPU/GPU side?
  2. How can Hugging Face models be exported to these special precisions?

Thank you !!!

kaleid-liner commented 1 month ago

Hi, @caoshijie0501 Sorry for late response, I am very interested in your work. Using your method, I was able to compile llama-bench and main with CUDA now.

However, after reading the paper, I have the following questions for you:

  1. Regarding extremely low-bit precision, I am curious about how you setup your baseline. For example, I saw the results for llama (W2) in your profile_data.md. How is these special precision executed on the CPU/GPU side?
  2. How can Hugging Face models be exported to these special precisions?

Thank you !!!

  1. As stated in README.md,

    We evaluate BitNet-3B and Llama-2-7B (W2) with T-MAC 2-bit and llama.cpp Q2_K, and evaluate Llama-2-7B (W4) with T-MAC 4-bit and llama.cpp Q4_0.

    You can refer to llama.cpp for how to obtain the quantized q2k/q40 models or convert from huggingace models, and how to execute the models with llama.cpp.

  2. Currently we have provided support for BitNet/EfficientQAT/GPTQ. You can first start from the pretrained quantized models provided by BitNet/EfficientQAT following the doc. If you want to evaluate a random huggingface unquantized models with t-mac, you can refer to GPTQModel or BitDistiller on how to quantize a model and follow our readme to evaluate it.

kaleid-liner commented 1 month ago

@Zijie-Tian As the issue described by the title is already solved, I will close it. For more questions related to this repo, please open a new issue. And for questions more related to the paper rather than the code, please open a discussion.