triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

CUDA architecture ignored when passed to Cmake #101

Open hillct opened 1 year ago

hillct commented 1 year ago

Description

Branch: Main
Base Docker Image: nvcr.io/nvidia/tritonserver:23.01-py3 (the image is likely irrelevant here)
System: AGX Orin w/jetpack 5.1

Reproduced Steps

/workspace/fastertransformer_backend/build# cmake     -D SM=87  -D CMAKE_EXPORT_COMPILE_COMMANDS=1       -D CMAKE_BUILD_TYPE=Release       -D CMAKE_INSTALL_PREFIX=/opt/tritonserver       -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}"       -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}"       -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}"       ..
-- Enable USE_TRITONSERVER_DATATYPE
-- Enable BUILD_MULTI_GPU.
-- Determining NCCL version from /usr/include/nccl.h...
-- Found NCCL (include: /usr/include, library: /usr/lib/aarch64-linux-gnu/libnccl.so)
-- Found MPI (include: , library: /usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi_cxx.so;/usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi.so)
-- Found NCCL (include: /usr/include, library: /usr/lib/aarch64-linux-gnu/libnccl.so)
-- Add DCUDA11_MODE
CUDA_VERSION 11 is greater or equal than 11, enable -DENABLE_BF16 flag
-- RapidJSON found. Headers: /usr/include
-- RapidJSON found. Headers: /usr/include
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Using CUDA 11.4
-- Found CUDA: /usr/local/cuda (found suitable version "11.4", minimum required is "10.2") 
CUDA_VERSION 11.4 is greater or equal than 11.0, enable -DENABLE_BF16 flag
-- Add DBUILD_CUTLASS_MOE, requires CUTLASS. Increases compilation time
-- Add DBUILD_CUTLASS_MIXED_GEMM, requires CUTLASS. Increases compilation time
-- Running submodule update to fetch cutlass
-- Add DBUILD_MULTI_GPU, requires MPI and NCCL
CMake Warning (dev) at build/_deps/repo-ft-src/CMakeLists.txt:84 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  CMake variable NCCL_ROOT is set to:

    /usr/local/cuda

  For compatibility, CMake is ignoring the variable.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Determining NCCL version from /usr/include/nccl.h...
-- Found NCCL (include: /usr/include, library: /usr/lib/aarch64-linux-gnu/libnccl.so)
-- USE_TRITONSERVER_DATATYPE
-- NVTX is enabled.
-- Assign GPU architecture (sm=70,75,80,86)
CMAKE_CUDA_FLAGS_RELEASE: -O3 -DNDEBUG -Xcompiler -O3 -DCUDA_PTX_FP8_F2FP_ENABLED --use_fast_math
-- COMMON_HEADER_DIRS: /workspace/fastertransformer_backend/build/_deps/repo-ft-src;/usr/local/cuda/include;/workspace/fastertransformer_backend/build/_deps/repo-ft-src/3rdparty/cutlass/include;/workspace/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/cutlass_extensions/include;/workspace/fastertransformer_backend/build/_deps/repo-ft-src/3rdparty/trt_fp8_fmha/src;/workspace/fastertransformer_backend/build/_deps/repo-ft-src/3rdparty/trt_fp8_fmha/generated
-- Found CUDA: /usr/local/cuda (found suitable version "11.4", minimum required is "10.1") 
-- Add DCUDA11_MODE
-- Configuring done
CMake Warning (dev) in build/_deps/repo-backend-src/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "kernel-library-new".
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /workspace/fastertransformer_backend/build

Please note that CUDA architectures are sill the default four:
-- Assign GPU architecture (sm=70,75,80,86)
despite being explicitly set -D SM=87
hillct commented 1 year ago

the resul is identical when specifying architecture using either of -D CUDA_ARCHITECTURES=87 or -D CMAKE_CUDA_ARCHITECTURES=87

byshiue commented 1 year ago

FasterTransformer does not support Orin now.

hillct commented 1 year ago

When can we expect this to be addressed? Is there a roadmap we should be referring to in this regard?

Aside from the build system behavior (which should a least report unsupported architecture a build time rather than ignoring the input, allowing a full (seemingly successful) build and relying on he JIT compiler to report at runtime:

[FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain.

It seems the Triton Server component has a build for AGX Orin or at least Jetpack 5.1 https://github.com/triton-inference-server/server/releases/tag/v2.31.0 alhough the corresponding tritonserver docker image includes cuda runtime version and oher componants that are specifically not compaible with Jepack 5.1, where the compatibility drag seems to be the CUDA Driver 520 which is clearly due for an update.

hillct commented 1 year ago

OK. So 8.6 (A100) is the highest compute capability we can compile against. What is the lowest compute capability we can compile against? I had a potential vendor offer me an environment involving GPU compute capability 3.7. Is is even worth trying, or is this a waste of my time?

byshiue commented 1 year ago

The lowest compute capability we have tested is 6.0. We cannot guarantee it works on lower compute capability.