pytorch / glow

Compiler for Neural Network hardware accelerators
Apache License 2.0
3.22k stars 689 forks source link

Support ahead-of-time compilation of partitioned graphs #1350

Open bertmaher opened 6 years ago

bertmaher commented 6 years ago

As discussed briefly in #1325 we need a way to perform AOT compilation of partitioned graphs. In its simplest form (serial execution), this process looks like:

for each function F:
  emit code for F
create main function G
for each function F:
  add call(F) to G

Even this fairly trivial implementation requires some refactoring to get there. In particular, the interface to BundleSaver is too simple; it simply consumes an IRFunction and emits a .o and .weights in an output directory.

We could simply punt this problem to the build tool: slurp in all the generated .os, generate C code to call each of them in turn, then compile that. That solution seems too hacky. The expectation should be to produce a single .o for a single network; partitioning should be transparent.

Some refactoring shall be needed ;-).

I don't quite know enough LLVM yet to recommend the One True Way to do this, but I suspect we'll want an LLVMIRGen (and the llvm::Module contained therein) to last through the emission of several IRFunctions. Incrementally perform codegen and add the result to the module, recording enough information along the way to emit calls. Finally, emit calls for the "umbrella" function, and write the module to an object file.

opti-mix commented 6 years ago

Yeah, it would require some refactoring ;-) I'd be happy to help with this.

stoklund commented 6 years ago

It makes sense to stick all of the functions into a single LLVM module along with a main driver function. The result will most likely be that LLVM inlines all of the functions into the main driver.

Makes you wonder what we're trying to achieve by AOT-compiling graphs this way?

opti-mix commented 6 years ago

It makes sense to stick all of the functions into a single LLVM module along with a main driver function. The result will most likely be that LLVM inlines all of the functions into the main driver.

First, we can prevent those functions from inlining.

Second, I think that the driver does necessarily need to know what are the callees. I'd imagine it could be table-driven. The tables would contain function pointers for the functions representing the partiions and other meta-information and the driver could use any policy (e.g. serial, parallel, etc), may be even specified by a command-line option, to run the graph.

This way we could reuse the same driver to run any partitioned graph.

bertmaher commented 6 years ago

@stoklund Right. AOT compiling the graph serially doesn't really achieve anything by itself. I'd consider it a stepping stone to implementing a more useful executor (e.g., multithreaded).