Closed stoklund closed 5 years ago
One responsibility of the Backend
class is to provide a common hardware abstraction layer interface. Each supported target architecture provides a subclass of Backend
that implements virtual functions which specialize compilation and execution to the particular target.
Old-school re-targetable compilers used #ifdef
s sprinkled all over their source code to specialize the compiler for a target architecture. The compiler could only be built for one architecture at a time. This meant that unit tests would only test the architecture you built, and you couldn't even detect syntax errors if they happened to be invisible in your current configuration. This made continuous integration very painful.
LLVM can be re-targeted at runtime. It uses a Target
class which has virtual functions that provide further information about a target architecture. LLVM's target configuration system is more complex than Glow needs.
The Cretonne code generator uses a TargetIsa
trait which encapsulates both the target architecture and any compiler flags and settings that can affect the code generation. This means that the generated code is a pure function of the input IR and the TargetIsa
instance. No secret command line flags can affect the code generation and cause hard-to-reproduce bugs.
Common to the LLVM and Cretonne designs is that their Target
and TargetIsa
objects are constant once they have been created. All the virtual methods are declared as const
. This means that one target object can be reused for multiple (concurrent) compilations, and the state related to target configuration is clearly separated.
To inform the design, it's useful to look at a couple use cases for Glow. These all assume that Glow is used as a JIT compiler, i.e. we're not concerned with cross compilation or saving compiled models to disk.
Running parallel inferences with a single fixed graph on a multicore CPU:
Same as above, but running on a multi-socket NUMA server:
To save memory, we want to avoid multiple copies of the weights. Instead, we partition the graph and distribute it among the sockets. That way, each socket holds part of the weights.
Say the weights of our graph are too big to fit on one GPU's high-bandwidth memory, but we have two GPUs. We want to partition the graph into two parts that each fit on one GPU.
We want to run inference on multiple different graphs with low latency. The weights for all the graphs fit on HBM of a single GPU.
We want the Backend
design to be compatible with these kinds of use cases. This doesn't necessarily mean that all the backends can do all these things, but the design shouldn't prevent them.
This all suggests that maybe variables should not belong to the Module along with the compiler IR. Such a change is not in scope for this issue, but it is worth keeping in mind when designing the Backend interface.
As a first step, we can split the Backend
state into three parts:
This first step does not address the issues with multithreading that the use cases bring up. A second refactoring step can handle this by distinguishing between a compiled function and a bound function which has fixed input and output locations.
The initial incarnation of CompiledFunction
above contains both the result of compilation and the state needed during execution, such as the input/output variables in the module. This means that a compiled function can't be run in two threads concurrently, for example.
We can separate these two kinds of data such that a single compilation can be reused for concurrent execution:
CompiledFunction
owns the compiled code and possibly some constant data.BoundFunction
has bindings to specific input/output variable instances and it owns memory buffers that are mutated during execution, such as internal activations.There can be multiple BoundFunction
instances associated with a single CompiledFunction
instance. This enables concurrent execution, whether on multiple threads or multiple hardware accelerators.
We can add a CompiledFunction::bind(...)
method which returns a unique_pointer<BoundFunction>
. Compare to the onnxSetGraphIO
function; cc @rdzhabarov. Then move CompiledFunction::execute()
to BoundFunction::execute()
An unresolved issue is how we handle multiple hardware devices. It seems that a BoundFunction
should also be associated with a specific device.
Question: Should we consider the idea of a dynamic backend registration in the scope of the Backend interface re-factoring?
Currently, we need to statically enumerate all the backends in the Backend.h
, in the BackendKind enum. And we have a createBackend
method that creates new backends instances based this information. This approach results in a tight coupling and in the need to re-build the whole project if a new backend is added. Also the integration of new backends is more complicated due to this.
May be we should switch to the registry model, where backends basically register themselves (e.g. their name and the backend object responsible for managing of new instances, etc) and then they can be created by calling something like createBackend(backendName)
?
Good idea, registerBackend in global constructor?
Is this completed? Are we leaving it for the larger runtime project?
I think the remaining refactoring can be part of the runtime project. They'll understand the requirements better.
The
glow::Backend
interface is current storing a lot of state, and it needs to be refactored. We want to clearly separate the hardware abstraction layer from the different kinds of state we're storing.Over in #1176 we want to compile multiple functions to run on different CPUs. In the future, we also want to support multiple GPUs and other accelerators, so we need a backend interface that separates state related to different functions and different execution units.
I'll elaborate on a design in the comments below.