Closed razpa closed 3 months ago
Hello, I'm running this repo over RTX A6000 from Docker pytorch:23.12. I've checked that my environment is set up correctly by:
print(torch.cuda.get_arch_list())
['sm_52', 'sm_60', 'sm_61', 'sm_70', 'sm_72', 'sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_90', 'compute_90']
nvcc --version
Build cuda_12.3.r12.3/compiler.33567101_0
I've followed the installation guide, but run into an error when building
cd awq/kernels python setup.py install
Seems like the code tries to build itself with compute_70, while my machine has much higher one (sm_90).
compute_70
sm_90
Can you please help solving this issue configuring the right compute capability (sm) for my machine using docker?
sm
Error:
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 711; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas fatal : Ptx assembly aborted due to errors
Thanks
Full console log for running python setup.py install:
python setup.py install
running install /usr/local/lib/python3.10/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /usr/local/lib/python3.10/dist-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( running bdist_egg running egg_info writing awq_inference_engine.egg-info/PKG-INFO writing dependency_links to awq_inference_engine.egg-info/dependency_links.txt writing requirements to awq_inference_engine.egg-info/requires.txt writing top-level names to awq_inference_engine.egg-info/top_level.txt reading manifest file 'awq_inference_engine.egg-info/SOURCES.txt' writing manifest file 'awq_inference_engine.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_ext building 'awq_inference_engine' extension creating /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new creating /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm creating /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemv Emitting ninja build file /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 FAILED: /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemm/gemm_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "ls" is undefined {ls -l /dev/nvidia* ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "l" is undefined {ls -l /dev/nvidia* ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "dev" is undefined {ls -l /dev/nvidia* ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(91): error: identifier "nvidia" is undefined {ls -l /dev/nvidia* ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(92): error: expected an expression __asm__ __volatile__( ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu(96): warning #12-D: parsing restarts here after previous syntax error : "r"(addr)); ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" 5 errors detected in the compilation of "/tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemm/gemm_cuda.cu". [2/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 FAILED: /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(34): warning #177-D: variable "ZERO" was declared but never referenced static constexpr uint32_t ZERO = 0x0; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(44): warning #177-D: variable "blockIdx_x" was declared but never referenced int blockIdx_x = 0; ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(65): warning #177-D: variable "ld_zero_flag" was declared but never referenced bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < 64; ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(21): warning #177-D: function "__pack_half2" was declared but never referenced __pack_half2(const half x, const half y) { ^ ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 711; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 715; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 719; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 723; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 727; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 731; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 735; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 739; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 743; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 747; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 751; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 755; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 759; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 763; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 767; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 771; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 823; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 827; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 831; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 835; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 839; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 843; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 847; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 851; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 855; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 859; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 863; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 867; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 871; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 875; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 879; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 883; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2187; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2191; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2195; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2199; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2203; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2207; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2211; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2215; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2219; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2223; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2227; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2231; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2235; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2239; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2243; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2247; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2299; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2303; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2307; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2311; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2315; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2319; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2323; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2327; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2331; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2335; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2339; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2343; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2347; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2351; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2355; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 2359; error : Feature '.m16n8k16' requires .target sm_80 or higher ptxas fatal : Ptx assembly aborted due to errors /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(34): warning #177-D: variable "ZERO" was declared but never referenced static constexpr uint32_t ZERO = 0x0; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(44): warning #177-D: variable "blockIdx_x" was declared but never referenced int blockIdx_x = 0; ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(65): warning #177-D: variable "ld_zero_flag" was declared but never referenced bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < 64; ^ /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemm_cuda_gen.cu(21): warning #177-D: function "__pack_half2" was declared but never referenced __pack_half2(const half x, const half y) { ^ [3/9] c++ -MMD -MF /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/ft_attention.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/ft_attention.o -g -O3 -fopenmp -lgomp -std=c++17 -DENABLE_BF16 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = short unsigned int; Masked_multihead_attention_params<T> = Multihead_attention_params<short unsigned int, false>; size_t = long unsigned int]’: /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:166:5: required from here /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<short unsigned int>’ {aka ‘struct Multihead_attention_params<short unsigned int, false>’}; use assignment or value-initialization instead [-Wclass-memaccess] 73 | memset(¶ms, 0, sizeof(params)); | ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:8: /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<short unsigned int>’ {aka ‘struct Multihead_attention_params<short unsigned int, false>’} declared here 122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> { | ^~~~~~~~~~~~~~~~~~~~~~~~~~ /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = __nv_bfloat16; Masked_multihead_attention_params<T> = Multihead_attention_params<__nv_bfloat16, false>; size_t = long unsigned int]’: /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:166:5: required from here /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<__nv_bfloat16>’ {aka ‘struct Multihead_attention_params<__nv_bfloat16, false>’}; use assignment or value-initialization instead [-Wclass-memaccess] 73 | memset(¶ms, 0, sizeof(params)); | ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:8: /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<__nv_bfloat16>’ {aka ‘struct Multihead_attention_params<__nv_bfloat16, false>’} declared here 122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> { | ^~~~~~~~~~~~~~~~~~~~~~~~~~ /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = float; Masked_multihead_attention_params<T> = Multihead_attention_params<float, false>; size_t = long unsigned int]’: /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:166:5: required from here /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<float>’ {aka ‘struct Multihead_attention_params<float, false>’}; use assignment or value-initialization instead [-Wclass-memaccess] 73 | memset(¶ms, 0, sizeof(params)); | ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /tmp/pycharm_project_505/awq/kernels/csrc/attention/ft_attention.cpp:8: /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<float>’ {aka ‘struct Multihead_attention_params<float, false>’} declared here 122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> { | ^~~~~~~~~~~~~~~~~~~~~~~~~~ [4/9] c++ -MMD -MF /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/pybind.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/pybind.cpp -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/pybind.o -g -O3 -fopenmp -lgomp -std=c++17 -DENABLE_BF16 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 [5/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/position_embedding/pos_encoding_kernels.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/position_embedding/pos_encoding_kernels.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/position_embedding/pos_encoding_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 [6/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemv_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemv_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization/gemv_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemv_cuda.cu(224): warning #177-D: variable "blockDim_z" was declared but never referenced int blockDim_z = num_out_feats; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /tmp/pycharm_project_505/awq/kernels/csrc/quantization/gemv_cuda.cu(224): warning #177-D: variable "blockDim_z" was declared but never referenced int blockDim_z = num_out_feats; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" [7/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/layernorm/layernorm.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/layernorm/layernorm.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/layernorm/layernorm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 [8/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemv/gemv_cuda.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemv/gemv_cuda.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/quantization_new/gemv/gemv_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemv/gemv_cuda.cu(83): warning #177-D: variable "kShuffleSize" was declared but never referenced static constexpr int kShuffleSize = 32; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /tmp/pycharm_project_505/awq/kernels/csrc/quantization_new/gemv/gemv_cuda.cu(83): warning #177-D: variable "kShuffleSize" was declared but never referenced static constexpr int kShuffleSize = 32; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" [9/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/decoder_masked_multihead_attention.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu -o /tmp/pycharm_project_505/awq/kernels/build/temp.linux-x86_64-3.10/csrc/attention/decoder_masked_multihead_attention.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(989): warning #177-D: variable "v_offset" was declared but never referenced int v_offset = k_offset; ^ detected during: instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(995): warning #177-D: variable "v_bias_offset" was declared but never referenced int v_bias_offset = k_bias_offset; ^ detected during: instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(989): warning #177-D: variable "v_offset" was declared but never referenced int v_offset = k_offset; ^ detected during: instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention_template.hpp(995): warning #177-D: variable "v_bias_offset" was declared but never referenced int v_bias_offset = k_bias_offset; ^ detected during: instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 70 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]" at line 111 of /tmp/pycharm_project_505/awq/kernels/csrc/attention/decoder_masked_multihead_attention.cu ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build subprocess.run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.10/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 74, in run self.do_egg_install() File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 116, in do_egg_install self.run_command('bdist_egg') File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 164, in run cmd = self.call_command('install_lib', warn_dir=0) File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 150, in call_command self.run_command(cmdname) File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install_lib.py", line 11, in run self.build() File "/usr/lib/python3.10/distutils/command/install_lib.py", line 109, in build self.run_command('build_ext') File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 876, in build_extensions build_ext.build_extensions(self) File "/usr/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/usr/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension _build_ext.build_extension(self, ext) File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/build_ext.py", line 135, in build_extension super(build_ext, self).build_extension(ext) File "/usr/lib/python3.10/distutils/command/build_ext.py", line 529, in build_extension objects = self.compiler.compile(sources, File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 689, in unix_wrap_ninja_compile _write_ninja_file_and_compile_objects( File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1777, in _write_ninja_file_and_compile_objects _run_ninja_build( File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension python-BaseException
Following #93 I was able to force the sm_90 by running: TORCH_CUDA_ARCH_LIST="9.0" python setup.py install.
TORCH_CUDA_ARCH_LIST="9.0" python setup.py install
Hello, I'm running this repo over RTX A6000 from Docker pytorch:23.12. I've checked that my environment is set up correctly by:
print(torch.cuda.get_arch_list())
>>['sm_52', 'sm_60', 'sm_61', 'sm_70', 'sm_72', 'sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_90', 'compute_90']
nvcc --version
>>Build cuda_12.3.r12.3/compiler.33567101_0
I've followed the installation guide, but run into an error when building
Seems like the code tries to build itself with
compute_70
, while my machine has much higher one (sm_90
).Can you please help solving this issue configuring the right compute capability (
sm
) for my machine using docker?Error:
ptxas /tmp/tmpxft_0000240b_00000000-7_gemm_cuda_gen.compute_75.ptx, line 711; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas fatal : Ptx assembly aborted due to errors
Thanks
Full console log for running
python setup.py install
: