turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

error: identifier "__hfma2" is undefined #380

Open timefliesfang opened 3 months ago

timefliesfang commented 3 months ago

When I tried to run the test_inference.py example, I met this error showing that failed to find all half-precision CUDA APIs. But my CUDA version is new (12.3) and I am using V100 GPU.

I checked the generated compile options of ninja, for example: [15/20] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output h_gemm.cuda.o.d -DTORCH_EXTENSION_NAME=exllamav2_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/group/ossdphi_algo_scratch_09/chaomfan/exllamav2/exllamav2/exllamav2_ext -isystem /group/ossdphi_algo_scratch_09/chaomfan/exllamav2/exllma_cuda/lib/python3.10/site-packages/torch/include -isystem /group/ossdphi_algo_scratch_09/chaomfan/exllamav2/exllma_cuda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /group/ossdphi_algo_scratch_09/chaomfan/exllamav2/exllma_cuda/lib/python3.10/site-packages/torch/include/TH -isystem /group/ossdphi_algo_scratch_09/chaomfan/exllamav2/exllma_cuda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_72,code=sm_72 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_87,code=sm_87 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -lineinfo -O3 -std=c++17 -c /group/ossdphi_algo_scratch_09/chaomfan/exllamav2/exllamav2/exllamav2_ext/cuda/h_gemm.cu -o h_gemm.cuda.o FAILED: h_gemm.cuda.o

there are some low version CUDA compile options invovled (e.g. -gencode=arch=compute_52,code=sm_52), when I removed these, this single file can be compiled successfully.

I wonder how to solve this issue. Thank you.

turboderp commented 3 months ago

Not sure why that's happening. It shouldn't be trying to compile for sm_52. Do you by any chance have an older NVIDIA GPU in the same system also?

RaRasputinRGLM commented 2 months ago

I have a similar error I am building from off of the current TensorRT-LLM docker, ubuntu with CUDA_ARCHS="89-real". I am using a 4070 TI, I am currently trying to debug. hfma2 keeps showing up as not found.

EDIT: also nvcc version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0

gxx version

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

2ND edit: droppped the tensorrt current build and went back to nvcr.io/nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

When it builds for that TensorRT-LLM docker it does something strange, changing build order seems to work, but running tests now

vancoykendall commented 2 months ago

I ran into the same error: identifier "__hfma2" is undefined when trying to install exllamav2 (pip install .). @timefliesfang I'm not sure how you removed the lower arch versions, but I was able to install it by setting TORCH_CUDA_ARCH_LIST.

Specifically, I installed with the command:

TORCH_CUDA_ARCH_LIST="8.0 8.6 9.0" pip install -e .

I'm installing inside Nvidia's NGC docker 23.12

@turboderp I'm guessing those lower archs got added to the generated compile code because the torch version I'm using was compiled for many different archs. You can check for you're pytorch version with torch.cuda.get_arch_list()

>>> torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']

Although, I don't know anything about how ninja works so I'm don't know for certain