mozilla / sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Apache License 2.0
5.85k stars 552 forks source link

Supporting `nvcc` in `sccache-dist` #2238

Closed trxcllnt closed 1 week ago

trxcllnt commented 3 months ago

Opening this issue to describe and track tasks related to implementing nvcc support in sccache-dist.

tl;dr; sccache should add cicc and ptxas as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.

Background

sccache-dist mode relies on a compiler's ability to compile preprocessed input. The source file is preprocessed on the client, looked up in the cache, and on cache miss the toolchain + preprocessed file is sent to the sccache-dist scheduler for compilation.

This model is not supported by NVIDIA's CUDA compiler nvcc, because nvcc lacks support for compiling preprocessed input. This does not represent a deficit in nvcc, rather it's an inability to align this feature with what nvcc actually does under the hood.

A CUDA C++ file contains standard C++ (CPU-executed) code and CUDA device code side-by-side. Internally nvcc runs a number of preprocessor steps to separate this code into host and device code that are each compiled by different compilers. nvcc can also be instructed to compile the same CUDA device code for different architectures and bundle them into a "fat binary".

The preprocessor output for each device architecture is potentially different, thus there is no single preprocessed input file nvcc can produce that could be fed back in to the compiler later. (A rough analogy is if gcc supported compiling and assembling objects for x86 + ARM which could be executed on either platform).

Rather than attempt to trick nvcc into compiling preprocessed input, sccache can decompose and distribute nvcc's constituent sub-compiler invocations.

Proposal

sccache should add cicc and ptxas as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work. sccache should change its nvcc compiler to run the underlying host and device compiler steps individually, with each step re-entering the standard hash + cache + (distributed) compilation pipeline.

sccache can do this by utilizing the nvcc --dryrun flag, which outputs the underlying calls executed by nvcc:

Click to expand nvcc --dryrun output
$ nvcc -c x.cu -o x.cu.o -gencode=arch=compute_60,code=[sm_60] -gencode=arch=compute_70,code=[compute_70,sm_70] --dryrun
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ CICC_PATH=/usr/local/cuda/bin/../nvvm/bin
#$ CICC_NEXT_PATH=/usr/local/cuda/bin/../nvvm-next/bin
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/home/ptaylor/.nvm/versions/node/v22.4.0/bin:/home/ptaylor/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ptaylor/.fzf/bin:/usr/local/cuda/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"
#$ LIBRARIES= "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii"
#$ cudafe++ --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" --stub_file_name "tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii"
#$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii"
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx"
#$ ptxas -arch=sm_60 -m64 "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx" -o "/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin"
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii"
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" -o "/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin"
#$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" "--image3=kind=ptx,sm=70,file=/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" --embedded-fatbin="/tmp/tmpxft_00003437_00000000-3_x.fatbin.c"
#$ rm /tmp/tmpxft_00003437_00000000-3_x.fatbin
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" -o "x.cu.o"

This output represents a sequence of preprocessing steps that must run on the sccache client, followed by compilation steps on the preprocessed result that can be distributed to sccache-dist workers.

Explanation

Here's a rough break down of the command stages above:

#$ gcc -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii" 
#$ cudafe++ --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed  --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" --stub_file_name "tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii" 

These two lines run the host preprocessor to resolve host-side macros and inline #includes, then run the CUDA front-end to separate the source into host and device source files. The sccache client should run both these steps before requesting any compilation jobs.

#$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" 
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed   -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.gpu"  "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx"
#$ ptxas -arch=sm_60 -m64  "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx"  -o "/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" 

In this phase,nvcc:

  1. runs the host compiler to preprocess the input file (x.cu)
  2. runs cicc on the output of step 1 to generate a .ptx file
  3. runs ptxas on the output of step 2 to assemble the PTX into a .cubin

All these steps must run sequentially. Step 1 must run on the sccache client, but 2 and 3 can be executed by sccache-dist workers.

#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" 
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed   -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.gpu"  "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64  "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx"  -o "/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" 

This is similar to the prior commands, except for a different GPU arch sm_70. These commands must still run sequentially with respect to each other, but they can run in parallel to the commands from the prior stage.

#$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" "--image3=kind=ptx,sm=70,file=/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" --embedded-fatbin="/tmp/tmpxft_00003437_00000000-3_x.fatbin.c" 
#$ rm /tmp/tmpxft_00003437_00000000-3_x.fatbin
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -c -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"   -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" -o "x.cu.o" 

In this stage, the outputs from the prior two stages are assembled into a .fatbin via the fatbinary invocation, then the original preprocessed host code is combined with the .fatbin and assembled into the final .o by the host compiler. These stages must run sequentially, but can be executed by sccache-dist workers (the final host compiler call can use the existing sccache-dist logic for preprocessing + distributing the work).

Additional Benefits

In addition to supporting sccache-dist in nvcc, this new behavior also benefits sccache clients that aren't configured to use distributed compilation, because sccache can now avoid compiling the underlying .ptx and .cubin device compilation artifacts assembled into the final .o.

For example, a CI job could compile code for all supported device architectures:

$ nvcc ... \
   -gencode=arch=compute_60,code=[sm_60] \
   -gencode=arch=compute_70,code=[sm_70] \
   -gencode=arch=compute_80,code=[sm_80] \
   -gencode=arch=compute_90,code=[compute_90,sm_90]

The above produces an object file with a certain hash, let's call it hash_all.

A developer may want to compile the same code with the same options, but for a smaller subset of architectures that match the GPU on their machine:

$ nvcc ... -gencode=arch=compute_90,code=[compute_90,sm_90]

Since the above produces an object file with a different hash (hash_subset), today sccache yields a cache miss on this .o file and re-runs nvcc (which itself runs cicc and ptxas) because the arguments + input don't match hash_all produced in CI.

However with the proposed changes, while sccache would still yield a cache miss for the .o produced by the nvcc command, it would yield a cache hit on the underlying .ptx and .cubin files produced by cicc and ptxas respectively, skipping the lions share of the actual compilation done by nvcc.

Tasks

Work is ongoing in this branch.

  1. [x] Add cicc and ptxas as first-class compilers supported by sccache
  2. [x] Support bundling cicc and ptxas toolchains from client's CUDA toolkit
  3. [x] Refactor nvcc compiler to call nvcc --dryrun, run each sub-command through sccache as appropriate