Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Opening this issue to describe and track tasks related to implementing nvcc support in sccache-dist.
tl;dr;sccache should add cicc and ptxas as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.
Background
sccache-dist mode relies on a compiler's ability to compile preprocessed input. The source file is preprocessed on the client, looked up in the cache, and on cache miss the toolchain + preprocessed file is sent to the sccache-dist scheduler for compilation.
This model is not supported by NVIDIA's CUDA compiler nvcc, because nvcc lacks support for compiling preprocessed input. This does not represent a deficit in nvcc, rather it's an inability to align this feature with what nvcc actually does under the hood.
A CUDA C++ file contains standard C++ (CPU-executed) code and CUDA device code side-by-side. Internally nvcc runs a number of preprocessor steps to separate this code into host and device code that are each compiled by different compilers. nvcc can also be instructed to compile the same CUDA device code for different architectures and bundle them into a "fat binary".
The preprocessor output for each device architecture is potentially different, thus there is no single preprocessed input file nvcc can produce that could be fed back in to the compiler later. (A rough analogy is if gcc supported compiling and assembling objects for x86 + ARM which could be executed on either platform).
Rather than attempt to trick nvcc into compiling preprocessed input, sccache can decompose and distribute nvcc's constituent sub-compiler invocations.
Proposal
sccache should add cicc and ptxas as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work. sccache should change its nvcc compiler to run the underlying host and device compiler steps individually, with each step re-entering the standard hash + cache + (distributed) compilation pipeline.
sccache can do this by utilizing the nvcc --dryrun flag, which outputs the underlying calls executed by nvcc:
This output represents a sequence of preprocessing steps that must run on the sccache client, followed by compilation steps on the preprocessed result that can be distributed to sccache-dist workers.
Explanation
Here's a rough break down of the command stages above:
These two lines run the host preprocessor to resolve host-side macros and inline #includes, then run the CUDA front-end to separate the source into host and device source files. The sccache client should run both these steps before requesting any compilation jobs.
This is similar to the prior commands, except for a different GPU arch sm_70. These commands must still run sequentially with respect to each other, but they can run in parallel to the commands from the prior stage.
In this stage, the outputs from the prior two stages are assembled into a .fatbin via the fatbinary invocation, then the original preprocessed host code is combined with the .fatbin and assembled into the final .o by the host compiler. These stages must run sequentially, but can be executed by sccache-dist workers (the final host compiler call can use the existing sccache-dist logic for preprocessing + distributing the work).
Additional Benefits
In addition to supporting sccache-dist in nvcc, this new behavior also benefits sccache clients that aren't configured to use distributed compilation, because sccache can now avoid compiling the underlying .ptx and .cubin device compilation artifacts assembled into the final .o.
For example, a CI job could compile code for all supported device architectures:
Since the above produces an object file with a different hash (hash_subset), today sccache yields a cache miss on this .o file and re-runs nvcc (which itself runs cicc and ptxas) because the arguments + input don't match hash_all produced in CI.
However with the proposed changes, while sccache would still yield a cache miss for the .o produced by the nvcc command, it would yield a cache hit on the underlying .ptx and .cubin files produced by cicc and ptxas respectively, skipping the lions share of the actual compilation done by nvcc.
Opening this issue to describe and track tasks related to implementing
nvcc
support insccache-dist
.tl;dr;
sccache
should addcicc
andptxas
as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.Background
sccache-dist
mode relies on a compiler's ability to compile preprocessed input. The source file is preprocessed on the client, looked up in the cache, and on cache miss the toolchain + preprocessed file is sent to thesccache-dist
scheduler for compilation.This model is not supported by NVIDIA's CUDA compiler
nvcc
, becausenvcc
lacks support for compiling preprocessed input. This does not represent a deficit innvcc
, rather it's an inability to align this feature with whatnvcc
actually does under the hood.A CUDA C++ file contains standard C++ (CPU-executed) code and CUDA device code side-by-side. Internally
nvcc
runs a number of preprocessor steps to separate this code into host and device code that are each compiled by different compilers.nvcc
can also be instructed to compile the same CUDA device code for different architectures and bundle them into a "fat binary".The preprocessor output for each device architecture is potentially different, thus there is no single preprocessed input file nvcc can produce that could be fed back in to the compiler later. (A rough analogy is if
gcc
supported compiling and assembling objects for x86 + ARM which could be executed on either platform).Rather than attempt to trick
nvcc
into compiling preprocessed input,sccache
can decompose and distributenvcc
's constituent sub-compiler invocations.Proposal
sccache
should addcicc
andptxas
as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.sccache
should change itsnvcc
compiler to run the underlying host and device compiler steps individually, with each step re-entering the standard hash + cache + (distributed) compilation pipeline.sccache
can do this by utilizing thenvcc --dryrun
flag, which outputs the underlying calls executed bynvcc
:Click to expand nvcc --dryrun output
This output represents a sequence of preprocessing steps that must run on the sccache client, followed by compilation steps on the preprocessed result that can be distributed to
sccache-dist
workers.Explanation
Here's a rough break down of the command stages above:
These two lines run the host preprocessor to resolve host-side macros and inline
#include
s, then run the CUDA front-end to separate the source into host and device source files. The sccache client should run both these steps before requesting any compilation jobs.In this phase,
nvcc
:x.cu
)cicc
on the output of step 1 to generate a.ptx
fileptxas
on the output of step 2 to assemble the PTX into a.cubin
All these steps must run sequentially. Step 1 must run on the sccache client, but 2 and 3 can be executed by
sccache-dist
workers.This is similar to the prior commands, except for a different GPU arch
sm_70
. These commands must still run sequentially with respect to each other, but they can run in parallel to the commands from the prior stage.In this stage, the outputs from the prior two stages are assembled into a
.fatbin
via thefatbinary
invocation, then the original preprocessed host code is combined with the.fatbin
and assembled into the final.o
by the host compiler. These stages must run sequentially, but can be executed bysccache-dist
workers (the final host compiler call can use the existingsccache-dist
logic for preprocessing + distributing the work).Additional Benefits
In addition to supporting
sccache-dist
innvcc
, this new behavior also benefitssccache
clients that aren't configured to use distributed compilation, becausesccache
can now avoid compiling the underlying.ptx
and.cubin
device compilation artifacts assembled into the final.o
.For example, a CI job could compile code for all supported device architectures:
The above produces an object file with a certain hash, let's call it
hash_all
.A developer may want to compile the same code with the same options, but for a smaller subset of architectures that match the GPU on their machine:
Since the above produces an object file with a different hash (
hash_subset
), todaysccache
yields a cache miss on this.o
file and re-runsnvcc
(which itself runscicc
andptxas
) because the arguments + input don't matchhash_all
produced in CI.However with the proposed changes, while
sccache
would still yield a cache miss for the.o
produced by thenvcc
command, it would yield a cache hit on the underlying.ptx
and.cubin
files produced bycicc
andptxas
respectively, skipping the lions share of the actual compilation done bynvcc
.Tasks
Work is ongoing in this branch.
cicc
andptxas
as first-class compilers supported by sccachecicc
andptxas
toolchains from client's CUDA toolkitnvcc
compiler to callnvcc --dryrun
, run each sub-command throughsccache
as appropriate