iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.59k stars 581 forks source link

Add generic IR symbol deduplication (vm.rodata, flow.executable, etc) #1144

Open benvanik opened 4 years ago

benvanik commented 4 years ago

During all of our compilation phases there are symbols generated that may have duplicates. On the simpler end for example several vm.rodata symbols may have the same exact byte contents (coming from the same UTF8-encoded string, etc). Moving higher in the stack there is duplication that is trickier to determine that would be nice to handle in a generic way. Several flow.executables may contain the same exact IR snippets (and is likely if the input IR was generated by library calls/etc).

A DuplicateSymbolFolding pass that ran on a region (module, etc) that did IR diffing could handle all of these cases.

stellaraccident commented 4 years ago

I'm not sure that a "generic" pass for all of this stuff is feasible/advisable, but passes that dedup flow.executable ops and vm.rodata ops individually may help in some cases. I don't, however, see a lot of big resnet low hanging fruit here:

It seems that if optimizing on this axis, some level of deduping will bear fruit, although I am dubious as to how much it helps resnet specifically. Surely, though, if we don't have anything to do this deduping, we'll lack the motivation to do other optimizatation (mostly de-specialization) that ultimately yield benefits.

Given that, there are three levels where we could do this:

  1. flow.executable
  2. hal.executable
  3. vm.rodata

Number 1 is annoying, because at this level, the executable still contains arbitrary IR, requiring a full recursive equivalence check. Possible, but annoying. Number 2 can be easier because all of the non-trivial parts have been compiled at this point, and all of that just becomes an attribute compare. In both cases, there is some internal symbol variability that, if we made it canonical, we could just directly compare the IR. Right now, there would be several special cases due to non-load-bearing differences.

Number 3 is probably too late, but valuable as a final fallback (likely useful for any global constants that should be deduped): code has already been emitted to initialize each executable, so simply deduping the symbols might save some binary size but would not help with initialization time/overhead because they would just alias the backing data, not undo the initialization code.

So leaving out priority of doing this, I would probably implement:

  1. Some canonicalization cleanups to executable symbols so that they do not trivially diff, making matching easier.
  2. A hal.executable or flow.executable deduper (tbd which one is better).
  3. An rodata deduper (which is pretty trivial) to catch any final things that are easily aliases together.
stellaraccident commented 4 years ago

Also, since size/startup overhead is not the primary thing we are losing on right now, I'd be tempted to back burner this until later.

benvanik commented 4 years ago

yeah rodata is pretty much classic link-time deduping -- too late to change things but helps with binary size.

flow.executable is the easiest to modify/despecialize as we have a good deal of flexibility in shapes and constants (it's trivial to pull them out). A fancy cost function would be nice, but in a lot of cases we can make some simple decisions at the HLO level that we know are safe (making outer dimensions used by convs dynamic, etc). Basically, anything that would end up just changing the invocation count (workgroup count) of the dispatch should be free. If our convs are all of different shapes but similar kernel sizes that should gain much more deduplication.

One nice thing about running deduplication at the various stages is that in some cases what isn't easily identifiable as a duplicate early on may become a duplicate later - for example, if one has a shape of x4 and another x16 but the backend always pads to x16. That avoids the need for super involved higher level despecialization as just ensuring good canonicalization while lowering will help.

Something you can check is ensuring the rodata doesn't contain names. For example, I know that the function name is based on the original outlined dispatch region name, which then gets exported as the entry point. That'll cause the rodatas to differ as they then contain the unique name of the source function. Making every flow.executable entry point name "main" or something would be a good sanity check on that.

benvanik commented 4 years ago

Another thing to try is disabling RematerializeDispatchConstantsPass - that is what currently takes a dispatch region with no constant values (only ops acting on input args) and inlines some of the constants - deduping prior to that may yield a better hit rate.

benvanik commented 4 years ago

The startup time is most pronounced on Vulkan (where we need to compile each shader with the driver) - 120ms on desktop for me for resnet and probably much more on mobile. We'll want to multithread that and add caching, but caching isn't guaranteed and it's still a cold start issue. #1587 & co that link executables together will help with LLVM AOT/VMLA (as all will end up in a single shared object/VM module) but in SPIR-V land the shaders will stay separate. Not a high priority vs other things as you say, but will eventually be important and could make use of this if we are getting the higher level parts in place anyway.

stellaraccident commented 4 years ago

Ok, I'll go ahead and implement this, along with some utility functions that may prove useful. I think that as we get sophisticated in other ways, having these passes in there will mean that we see the results, which creates the right ramps to optimize further.

benvanik commented 4 years ago

Looking at this dump of resnet: https://gist.githubusercontent.com/benvanik/280d94acaecbfb577984f16b391aedc7/raw/bcccb340afbe4810d15cf6285fa4ea4b22654873/resnet%2520regions

I went through and did a simple textual comparison of regions (select inner contents, ctrl-f); image image image

etc etc this is with the old identify dispatch region code - I'm not sure if you were looking at it with your new identify dispatch region 2 algorithm (that may pessimize things), but there's definitely a lot of duplication an IR diff dragnet should be able to catch.

Note that this is prior to running the RematerializeDispatchConstantsPass as mentioned above - after that runs there's a much higher chance that the region bodies will differ by small constant values, and a pass that ran after dedupe to find all dispatches and compare common constants (or only do RematerializeDispatchConstantsPass on flow.executables after outlining) would fix that.

benvanik commented 3 years ago

Some thoughts on discord as to additional reasons to generalize deduplication (in the short term too): https://discordapp.com/channels/689900678990135345/760577505840463893/777429896270970890 https://discordapp.com/channels/689900678990135345/760577505840463893/777447747091431434