mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
2.38k stars 184 forks source link

Invalid Compute Capability when building Docker pytorch:23.12 #198

Closed razpa closed 3 months ago

razpa commented 3 months ago

Hello, I'm running this repo over RTX A6000 from Docker pytorch:23.12. I've checked that my environment is set up correctly by:

  1. print(torch.cuda.get_arch_list()) >> ['sm_52', 'sm_60', 'sm_61', 'sm_70', 'sm_72', 'sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_90', 'compute_90']
  2. nvcc --version >> Build cuda_12.3.r12.3/compiler.33567101_0

I've followed the installation guide, but run into an error when building

cd awq/kernels
python setup.py install

Seems like the code tries to build itself with compute_70, while my machine has much higher one (sm_90).

Can you please help solving this issue configuring the right compute capability (sm) for my machine using docker?

Error:

ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 711; error : Feature '.m16n8k16' requires .target sm_80 or higher

ptxas fatal : Ptx assembly aborted due to errors

Thanks

Full console log for running python setup.py install:

running install
/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing awq_inference_engine.egg-info/PKG-INFO
writing dependency_links to awq_inference_engine.egg-info/dependency_links.txt
writing requirements to awq_inference_engine.egg-info/requires.txt
writing top-level names to awq_inference_engine.egg-info/top_level.txt
reading manifest file 'awq_inference_engine.egg-info/SOURCES.txt'
writing manifest file 'awq_inference_engine.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
building 'awq_inference_engine' extension
creating /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new
creating /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm
creating /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemv
Emitting ninja build file /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
FAILED: /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "ls" is undefined
  {ls -l /dev/nvidia*
   ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "l" is undefined
  {ls -l /dev/nvidia*
       ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "dev" is undefined
  {ls -l /dev/nvidia*
          ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "nvidia" is undefined
  {ls -l /dev/nvidia*
              ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(92): error: expected an expression
    __asm__ __volatile__(
    ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(96): warning #12-D: parsing restarts here after previous syntax error
        : "r"(addr));
                    ^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
5 errors detected in the compilation of "/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu".
[2/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
FAILED: /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(34): warning #177-D: variable "ZERO" was declared but never referenced
    static constexpr uint32_t ZERO = 0x0;
                              ^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(44): warning #177-D: variable "blockIdx_x" was declared but never referenced
    int blockIdx_x = 0;
        ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(65): warning #177-D: variable "ld_zero_flag" was declared but never referenced
    bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < 64;
         ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(21): warning #177-D: function "__pack_half2" was declared but never referenced
  __pack_half2(const half x, const half y) {
  ^
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 711; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 715; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 719; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 723; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 727; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 731; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 735; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 739; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 743; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 747; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 751; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 755; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 759; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 763; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 767; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 771; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 823; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 827; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 831; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 835; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 839; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 843; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 847; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 851; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 855; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 859; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 863; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 867; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 871; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 875; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 879; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 883; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2187; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2191; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2195; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2199; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2203; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2207; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2211; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2215; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2219; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2223; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2227; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2231; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2235; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2239; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2243; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2247; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2299; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2303; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2307; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2311; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2315; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2319; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2323; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2327; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2331; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2335; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2339; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2343; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2347; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2351; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2355; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2359; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas fatal   : Ptx assembly aborted due to errors
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(34): warning #177-D: variable "ZERO" was declared but never referenced
    static constexpr uint32_t ZERO = 0x0;
                              ^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(44): warning #177-D: variable "blockIdx_x" was declared but never referenced
    int blockIdx_x = 0;
        ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(65): warning #177-D: variable "ld_zero_flag" was declared but never referenced
    bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < 64;
         ^
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(21): warning #177-D: function "__pack_half2" was declared but never referenced
  __pack_half2(const half x, const half y) {
  ^
[3/9] c++ -MMD -MF /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/ft_attention.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/ft_attention.o -g -O3 -fopenmp -lgomp -std=c++17 -DENABLE_BF16 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = short unsigned int; Masked_multihead_attention_params<T> = Multihead_attention_params<short unsigned int, false>; size_t = long unsigned int]’:
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:166:5:   required from here
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<short unsigned int>’ {aka ‘struct Multihead_attention_params<short unsigned int, false>’}; use assignment or value-initialization instead [-Wclass-memaccess]
   73 |     memset(&params, 0, sizeof(params));
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:8:
/tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<short unsigned int>’ {aka ‘struct Multihead_attention_params<short unsigned int, false>’} declared here
  122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = __nv_bfloat16; Masked_multihead_attention_params<T> = Multihead_attention_params<__nv_bfloat16, false>; size_t = long unsigned int]’:
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:166:5:   required from here
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<__nv_bfloat16>’ {aka ‘struct Multihead_attention_params<__nv_bfloat16, false>’}; use assignment or value-initialization instead [-Wclass-memaccess]
   73 |     memset(&params, 0, sizeof(params));
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:8:
/tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<__nv_bfloat16>’ {aka ‘struct Multihead_attention_params<__nv_bfloat16, false>’} declared here
  122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = float; Masked_multihead_attention_params<T> = Multihead_attention_params<float, false>; size_t = long unsigned int]’:
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:166:5:   required from here
/tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<float>’ {aka ‘struct Multihead_attention_params<float, false>’}; use assignment or value-initialization instead [-Wclass-memaccess]
   73 |     memset(&params, 0, sizeof(params));
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:8:
/tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<float>’ {aka ‘struct Multihead_attention_params<float, false>’} declared here
  122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~
[4/9] c++ -MMD -MF /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/pybind.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/pybind.cpp -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/pybind.o -g -O3 -fopenmp -lgomp -std=c++17 -DENABLE_BF16 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1
[5/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/position_embedding/pos_encoding_kernels.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/position_embedding/pos_encoding_kernels.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/position_embedding/pos_encoding_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
[6/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemv_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemv_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemv_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemv_cuda.cu(224): warning #177-D: variable "blockDim_z" was declared but never referenced
      int blockDim_z = num_out_feats;
          ^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemv_cuda.cu(224): warning #177-D: variable "blockDim_z" was declared but never referenced
      int blockDim_z = num_out_feats;
          ^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
[7/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/layernorm/layernorm.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/layernorm/layernorm.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/layernorm/layernorm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
[8/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemv/gemv_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemv/gemv_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemv/gemv_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemv/gemv_cuda.cu(83): warning #177-D: variable "kShuffleSize" was declared but never referenced
      static constexpr int kShuffleSize = 32;
                           ^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemv/gemv_cuda.cu(83): warning #177-D: variable "kShuffleSize" was declared but never referenced
      static constexpr int kShuffleSize = 32;
                           ^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
[9/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/decoder_masked_multihead_attention.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/decoder_masked_multihead_attention.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
/tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(989): warning #177-D: variable "v_offset" was declared but never referenced
      int v_offset = k_offset;
          ^
          detected during:
            instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
            instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(995): warning #177-D: variable "v_bias_offset" was declared but never referenced
      int v_bias_offset = k_bias_offset;
          ^
          detected during:
            instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
            instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
/tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(989): warning #177-D: variable "v_offset" was declared but never referenced
      int v_offset = k_offset;
          ^
          detected during:
            instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
            instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(995): warning #177-D: variable "v_bias_offset" was declared but never referenced
      int v_bias_offset = k_bias_offset;
          ^
          detected during:
            instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
            instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 116, in do_egg_install
    self.run_command('bdist_egg')
  File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)
  File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/usr/lib/python3.10/distutils/command/install_lib.py", line 109, in build
    self.run_command('build_ext')
  File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 876, in build_extensions
    build_ext.build_extensions(self)
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions
    self._build_extensions_serial()
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial
    self.build_extension(ext)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
    super(build_ext, self).build_extension(ext)
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 529, in build_extension
    objects = self.compiler.compile(sources,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 689, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1777, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
python-BaseException
razpa commented 3 months ago

Following #93 I was able to force the sm_90 by running: TORCH_CUDA_ARCH_LIST="9.0" python setup.py install.