Operate on WebAssembly?

fitzgen commented 6 years ago

I don't see any references to it anymore, but my understanding was that Souper supported not just LLVM IR but also MSVC's IR. With that in mind, I think it would be exciting if Souper also supported WebAssembly as an IR (potentially hinting to binaryen what optimizations it is missing).

Brief intro to WebAssembly for the uninitiated (from http://webassembly.org/ ):

WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable target for compilation of high-level languages like C/C++/Rust, enabling deployment on the web for client and server applications.

WebAssembly also has an extensive specification, which hopefully makes supporting it easier than it would otherwise be.

I'm not proposing anything concrete (yet?), just trying to gauge interest and get feedback.

Thoughts?

cc @kripken @sunfishcode

fitzgen commented 6 years ago

Related: a simple, offline super optimizer for wasm by @kripken: https://github.com/WebAssembly/binaryen/blob/superoptimizer/src/tools/wasm-analyze.cpp

regehr commented 6 years ago

I'll need to think about this, but I'll just note right now that Souper doesn't support MSVC, but rather MSVC IR and Souper IR are similar enough that Gratian at Microsoft wrote a quick MSVC->Souper translator and then manually implemented a bunch of optimizations that it discovered. Also, we recently learned that the Mono compiler people used Souper similarly! But I don't have much detail about that yet. I assume they used their SSA IR but I don't know for sure.

As far as non-SSA IRs go, a while ago I started on a CompCert->Souper translator but got stalled for lack of time. I don't have a good sense for how easy it would be to do a WebAssembly->Souper translator.

regehr commented 6 years ago

Ok, I thought about this a bit more.

The first issue is getting Souper to understand WebAssembly, this is less trivial than other cases due to the stack machine. On the other hand, a very nice thing about conversion to Souper is that you can stop whenever you want, for example because you run into something tricky. I've never written code to symbolically execute stack machine code to make it into register code, but conceptually this doesn't sound that hard (again, as long as we can bail out when things get tricky).

As far as I know, after un-stack-conversion (is there a word for this?) the underlying operations should map nicely between WebAssembly and Souper, and Souper's synthesizer should find optimizations perfectly normally.

The tricky part, then, is translating the synthesized results back into WebAssembly optimizations. I suppose this would be 100% manual at first, and then there are probably things that make sense to automate.

So anyhow my initial impression is that we want an unstackifier that makes Souper IR out of webasm, and then we can proceed from there. I think this makes more sense than trying to shoehorn a stack machine into the Souper implementation.

fitzgen commented 6 years ago

Thanks for the thoughtful reply :)

One benefit I see of destackifying wasm into Souper IR is that experimentation developing that tool can happen out-of-tree and can move fast and loose.

Souper's ability to bail out on tricky constructs is nice, too. This enables incremental development, where we could get an MVP working and then start ratcheting up the percent wasm that is understood by the translator.

On the other hand, I also have a concern with this approach. Code size is super important for wasm because it is delivered over the network, and therefore potential size optimizations are particularly enticing. I fear that translating wasm into Souper IR won't help us discover "silly" wasm-specific size optimizations, like https://kripken.github.io/blog/binaryen/2018/04/13/a-silly-optimization.html

Maybe this fear is unfounded, and size optimizations aren't as specific to the instruction set and its encoding as I'm assuming? Or maybe translation into Souper IR just isn't the right approach for wasm size optimization, and further development of a wasm-specific super optimizer is a more promising avenue of investigation?

regehr commented 6 years ago

I have the same worry about size! Best we can do is give it a try, and maybe it'll turn out that re-stackifying the Souper output gets back the benefits? Synthesis is a hard enough problem that I think it's worth exploring with Souper before doing more one-offs. Random thought: I wonder if perhaps the webassembly typechecker/bytecode verifier/whatever contains the simple symbolic execution engine that we want for unstackifying webasm?

kripken commented 6 years ago

I agree this is definitely worth trying! I could look into creating a Binaryen pass to convert Binaryen IR to Souper IR, if that would be helpful? (Binaryen IR is almost identical to wasm, can be converted to and from it, and preserves size and performance aspects.) It might be simplest for me to convert to the Souper text format maybe, to get started?

About the size concern, yeah, that seems like it could be an issue. For example if the wasm => Souper IR conversion is inefficient, then Souper's proposed optimizations might be to remove those inefficiencies. We'll just have to make sure it's a good conversion :)

regehr commented 6 years ago

Alon, if you could hack something up that generates Souper text, this would be great! I'm happy to read code, play with crappy prototypes, etc. Also I'm happy to do the Souper runs (synthesizer can be finicky).

One thing to keep in mind is that one of Souper's most powerful features is its path conditions-- so if there's a way to generate those easily as part of the conversion, this will be especially helpful in generating optimizations (though applying the optimizations becomes less straightforward).

kripken commented 6 years ago

Great! I'll see what I can hack up.

Is the Souper paper the best place for docs on the IR text format?

About path conditions, naively it seems like that should be easy since we know the conditions of conditional branches etc. But if we need to do stuff like forward conditions through multiple conditional branches etc. (if the condition is still true in them) then I'm not sure offhand.

regehr commented 6 years ago

Argh we don't really have docs for Souper IR -- but it is really simple. I guess maybe ask me lots of questions and look at the test cases?

find souper/test -name '*.opt'

regehr commented 6 years ago

Re. path conditions, extracting the condition from only the most recent branch is probably the way to go.

chenyang78 commented 6 years ago

Sorry for chiming in on this :) Beside *.opt as John says, I think souper/InstRef.rst is a good place in terms of Souper IR. Otherwise, you may have to end up scanning the relevant code.

In terms of path condition (PC in Souper's terminology), I think knowing the conditions of conditional branches should be enough for the first version. You can skip the BlockPC part at the moment.

regehr commented 6 years ago

Ah, regarding docs/InstRef.rst, let me take a look and make sure it's up to date, has not been touched for a quite a while.

chenyang78 commented 6 years ago

@regehr , right, I just realized that we haven't touched it for a while :)

kripken commented 6 years ago

Thanks @regehr and @chenyang78, those look like what I need to get started!

kripken commented 6 years ago

Ok, I wrote up something that seems like it's starting to have the right general shape for simple things. Feedback welcome! Code is in a souper branch in binaryen, the new pass is here.

To try it, you can do something like

wasm-opt input.wasm --flatten --simplify-locals-nonesting --souperify

(flatten flattens the IR; simplify-locals-nonesting simplifies it while keeping it flat; souperify runs the actual conversion). This will print out all the LHSes it can handle. Example input and output are in the test suite. For example, something conceptually like

int func(int x) {
  if (x & 1)
    x += 1;
  else
    x += 2;
  return x & 1;
}

can be written in wasm like this

    (if
      (i32.and
        (get_local $x)
        (i32.const 1)
      )
      (set_local $x
        (i32.add
          (get_local $x)
          (i32.const 1)
        )
      )
      (set_local $x
        (i32.add
          (get_local $x)
          (i32.const 2)
        )
      )
    )
    (return
      (i32.and
        (get_local $x)
        (i32.const 1)
      )
    )
)

and --souperify will emit 5 LHSes for it, the interesting one is for the returned value, which is

%0 = block 2
%1 = var
%2 = and %1, 1:i32
%3 = blockpc %0 0 %2 1:i32
%4 = blockpc %0 1 %2 0:i32
%5 = add %1, 1:i32
%6 = add %1, 2:i32
%7 = phi %0, %5, %6
%8 = and %7, 1:i32
infer %8

which looks like it might be valid Souper IR. I didn't try to load it yet though... :)

Still a lot of TODOs (IR not yet converted) and FIXMEs (no i1 conversions yet). It just stops a particular LHS when it see stuff it can't handle. For control flow so far only ifs are handled. blockpcs for them are implemented too (see example above), but no regular pcs yet. I suspect ifs are enough control flow to get interesting results.

Some thoughts as I was doing this:

One thing I would expect the Souper approach to help with is "backend" inefficiencies. wasm is often code that has been run through the LLVM optimizer (nothing new there for Souper), but then it goes through an LLVM backend, which maybe introduces extra work there.
wasm not run through the LLVM optimizer is interesting too, e.g. generated from Go or AssemblyScript, and maybe optimized by Binaryen. Souper on such wasm could find stuff the Go, AssemblyScript, and Binaryen optimizers could do better. I'm guessing there is quite a lot there!
We can't expect this to handle very low-level wasm things like -(-64) being smaller than 64, or things with tees, block and if values, etc. All those details get lost in the translation to Souper IR. Overall I expect we can benefit both from the Souper approach and from superoptimizing on wasm directly (I recently got that building again for more tests), they are probably complementary.

regehr commented 6 years ago

Awesome. Is there a particular C/C++ code base that is particularly representative of code you're interested in optimizing, that I should start with? Otherwise I'll just build a few random things.

So far I've been creating wasm by passing --target=wasm32 to clang, is that the right approach?

Indeed, Go's equivalent of InstCombine is quite weak so Souper will find a lot of stuff.

There's an interesting piece of this puzzle that I don't think has been attacked seriously yet, that would not only be useful to solve but also would make a good submission to a conference like PLDI. This is taking patterns from Souper, perhaps abstracting them so they're more broadly applicable, and then making a very fast matcher for them. Just mentioning this since it's something I'm very interested in working on.

kripken commented 6 years ago

Awesome. Is there a particular C/C++ code base that is particularly representative of code you're interested in optimizing, that I should start with? Otherwise I'll just build a few random things.

Hmm, some good samples are here: https://github.com/kripken/embenchen/tree/master/asm_v_wasm (that's output from the emscripten benchmark suite). In particular the larger ones like bullet, box2d, and lua, those are important real-world codebases I think.

So far I've been creating wasm by passing --target=wasm32 to clang, is that the right approach?

In general yes, however it's worth running the binaryen optimizer on it too (wasm-opt -Os input.wasm -o output.wasm) as it will shrink it by 10% or so. Probably not the same optimizations as Souper would look for, but while shrinking it makes the code cleaner which is more likely to work in the --souperify pass.

regehr commented 6 years ago

Thanks Alon. I should have a bit of time this week to work on this!

kripken commented 6 years ago

pc implemention (for ifs, the only control flow so far) added to that branch. Output looks like it might be correct, but I'm not sure :)

regehr commented 6 years ago

I've been reading the code looking to fix a couple of easy things, but if you're in there right now it'll be quicker for you to do these if you have time!

var instructions require a width, e.g. %x:i32 = var
we don't support redundant comparison instructions, so we don't have sgt, sge, ugt, uge (I know this is quirky and weird, but canonicalization has zero relevance in Souper IR so we've not bothered to fix this)
pc arguments need to be i1 instead of i32

I'm working on updating our Inst documentation to reflect this kind of thing, thanks for your patience!

rsas commented 6 years ago

Hi Alon, great idea to apply superoptimization on wasm! Particularly, I second the thought of applying Souper on the wasm directly that comes from non-LLVM back-ends. Let's start with getting the Souper IR right first.

kripken commented 6 years ago

@regehr thanks! I fixed those 3 things on the branch.

Output looks better now, but probably still a lot left to do to get it right. Also, to add operations aside from binaries (unaries, etc.)...

regehr commented 6 years ago

Ok, great! I'm now seeing two small issues.

First, while most LHSs have the proper var declarations there are also some degenerate LHSs that are just this, which makes Souper's parser barf:

; start LHS
infer %0

Then second there's a common pattern of comparing the results of comparisons against zero using the wrong width as seen here:

; start LHS
%0:i32 = var
%1 = slt -1000999:i32, %0
%2 = sle 0:i32, %0
%3 = eq %2, 0:i32
pc %3 1:i1
infer %1

Since Souper doesn't have any implicit sext/zext, it should be:

%3 = eq %2, 0:i1

But leaving aside these issues I'm already seeing some perhaps-useful stuff coming out of the synthesizer!

regehr commented 6 years ago

As a random example, here's an optimization that Souper would like to perform. LHS:

%0:i32 = var
%1 = sle 0:i32, %0
%2:i32 = zext %1
%3:i32 = var
%4 = sle 0:i32, %3
%5:i32 = zext %4
%6 = or %2, %5
%7 = eq %6, 0:i32
infer %7

RHS:

%8:i32 = and %0, %3
%9:i1 = slt %8, 0:i32
result %9

kripken commented 6 years ago

Exciting to see results already! And nice optimization it found there.

Those two issues should be fixed on the branch. Also added unary ops (cttz etc.).

regehr commented 6 years ago

Ok! Next two issues should also be easy:

please use lshr/ashr instead of sshr/ushr
blockpc doesn't get a label in Souper IR, so instead of:
```
%3 = blockpc %0 0 %2 1:i1
```
we want:
```
blockpc %0 0 %2 1:i1
```

regehr commented 6 years ago

And here are a few more results: https://gist.github.com/regehr/a428f164ae60ff783e5872d5efc5416e

To get these, I took wasm_lua_scimark.c.wasm and ran it through binaryen -O3, then ran the output through Souper in "constant synthesis" mode, where the only thing it does is try to prove that the LHS can be replaced by a constant.

I haven't gone and tracked these back to the wasm, so some of them could be artifacts of the translation, but it looks like at least some of them represent real opportunities for shaving code size.

To facilitate backtracking, would it make sense for --souperify to also print the wasm in a comment? Or print a line number for the root of the LHS maybe?

regehr commented 6 years ago

You'll notice that sometimes the LHS contains far more context than is required to justify the optimization, making it harder to understand what's going on. I have a reducer for LHSs that I can run in the future.

kripken commented 6 years ago

Thanks! Fixed those two minor issues on the branch.

About backtracking, I need to think about that - we have a DebugInfo mechanism that might work somehow, but it's not trivial to connect it. Looks like we definitely need a way to do that for debugging, though, as several of those results do look like my translation is wrong, for example the optimizer definitely knows that multiplying an integer by zero is zero, and can add two constants, etc.

kripken commented 6 years ago

To get something simple I added support for setting BINARYEN_DEBUG_SOUPERIFY=1 in the env, then it prints each wasm expression before the souper one. However, it's not in a comment (no trivial way do that atm, I'll look into it) but worse it's printing the "flattened" IR, so it may not be useful enough for real debugging.

However, when trying to use this to debug those weird muls and adds, I think I might be misreading your gist:

; RHS inferred successfully
%0:i32 = add 0:i32, 1:i32
cand %0 1:i32

; RHS inferred successfully
%0:i32 = var
%1:i32 = mul 0:i32, %0
cand %1 0:i32

Does that mean I should be able to search in the output from processing that lua file (after -O3) and find the strings = add 0 or = mul 0 in the LHSes? Neither seems to show up. All the adds and muls have a non-constant left side (starting with %) as I'd expect, so maybe I'm confused? Or was this another file than https://github.com/kripken/embenchen/blob/master/asm_v_wasm/wasm_lua_scimark.c.wasm ?

regehr commented 6 years ago

I'm getting an abort on the latest binaren/souperify:

~/embenchen/asm_v_wasm$ ~/binaryen/build/bin/wasm-opt -O3 wasm_lua_scimark.c.wasm -o souper-input.wasm
~/embenchen/asm_v_wasm$ ~/binaryen/build/bin/wasm-opt souper-input.wasm --flatten --simplify-locals-nonesting --souperify > lhs.opt
Aborted

regehr commented 6 years ago

You're not misreading the gist, Souper legit thinks that there's a mul x, 0 in there. I'll get more details, but it looks like the abort is interfering by making binaryen stop before it gets to the thing I want to show you.

kripken commented 6 years ago

Sorry about the abort - stability has decreased as I added more features (select etc.) - it's converting a lot more code now so it's hitting more corner cases. I pushed some bugfixes now, which fix wasm_lua_scimark.c.wasm and a bunch of other large codebases.

regehr commented 6 years ago

OK! Still working with a wasm_lua_scimark.c.wasm that has been passed through binaryen -03, I now see only one species of souper error, a width mismatch on code like below. Since the result of a Souper comparison is i1, %4 should be extended to i32 before passing it into the phi node.

; start LHS
%0 = block 2
%1:i32 = var
; (i32.or(get_local $15)(i32.const 32))
%2 = or %1, 32:i32
; (i32.add(get_local $16)(i32.const -97))
%3 = add %2, -97:i32
; (i32.lt_u(get_local $17)(i32.const 26))
%4 = ult %3, 26:i32
%5 = phi %0, %1, %4
infer %5

regehr commented 6 years ago

Interesting-- most of the constant optimizations have now gone away, hopefully the missing ones were just the bogus ones! See the remaining optimizations here: https://gist.github.com/regehr/cd0d7a4c1f99226f4a78571bb72df0f4 Synthesizing instructions is slower, I'll start that going now.

kripken commented 6 years ago

Ok, fixed that phi/i1 bug.

Looking at those latest optimizations,

%1 = phi %0, 7:i32, 7:i32

That seems like it might be a bug in my phi generation, I'll look into it.

kripken commented 6 years ago

Souper found something important there! :) It wasn't a bug in my phi generation, it was a real missing optimization. Fixed in https://github.com/WebAssembly/binaryen/pull/1512 (more details there) and also included in the souper branch.

After that, I don't see any more trivial phis with identical constant inputs (although, it's possible more exist due to a different cause somehow).

regehr commented 6 years ago

Nice! I'll followup with results from synthesis, but probably after a delay since I have a trip later this week.

kripken commented 6 years ago

@regehr when you get a chance, a question: I noticed I was halting LHS generation early on anything bad, like a function call, so I switched that to emit a var instead for such values. Now it's generating really huge LHSes for really huge functions, e.g. in qt.wasm some are thousands of lines long (!) Is there a practical size limit for LHSes that I should stop at?

regehr commented 6 years ago

OK, the question "how big to make the LHS" gets into some thorny details.

Answer 1 is that you want the LHS as large as possible since this gives the solver the most information. In the simple cases -- constant synthesis and "nop synthesis" where the output is already available somewhere in the input graph -- this is always the right answer, unless it overloads the solver to the point where timeouts/memoryouts become an issue.

Answer 2 is not as fun because of how Souper synthesis currently works. Synthesis has to make an entire RHS from scratch, it cannot reuse parts of the LHS. SInce synthesis scales to maybe 5-7 instructions, a larger LHS is unlikely to have any equivalent RHS that is within practical reach. We've known about this issue for a while and unfortunately solving it requires some deep changes to Souper. We (or more to the point @rsas) have implemented many of the necessary pieces, but they're not all there yet. In the meantime, instruction synthesis may succeed if you extract only up to some smaller depth, for example graph height in the range 2-4.

So, the long-term answer is: put as much as you can into Souper. The short term answer is that we'd like a configurable depth-bound in the converter from X to Souper. We have such a thing for LLVM IR, though it only lives in a branch. Anyhow this should be an easy feature to add -- if you do that, I'll play around with various depth limits.

Sorry to not have a nicer answer!

regehr commented 6 years ago

Overall, of course, turning a Souper-stopper into a var, as opposed to bailing out entirely, is absolutely the correct answer.

regehr commented 6 years ago

Randomly, if you happen to have linkage info, there's no reason not to peek into called functions and get more instructions from there. It's one of many things I've wanted to try but haven't had time for. Perhaps it doesn't matter much once we've done a good job inlining, though. Certainly a bit of restraint would be called for, or else LHSs would become really big.

regehr commented 6 years ago

Ok, keeping on going with that same .wasm file, here are the constants and nops: https://gist.github.com/regehr/c86af851d60a75e04ce19e270bcfc554 I think some of these are actionable, but definitely not all.

At some point we should talk about undefined behavior. Souper is going by LLVM's rules, which are a bit looser than webassembly's, particularly for shift instructions. This will eventually give us some spurious optimizations. We don't have any kind of pluggability for instruction semantics in Souper, so will have to figure out how to best deal with this issue. Possibly just a one-off hack. Anyhow I'm going to wait until the problem shows itself before thinking more seriously.

regehr commented 6 years ago

One other thing: I'd like to compare the optimizations coming out of the .wasm files with the optimizations coming out of the corresponding .bc files. I don't see those in the embenchen repo, can you let me know where I might find them?

regehr commented 6 years ago

Here are a few synthesis results: https://gist.github.com/regehr/b7b990d7c49cec51e14c646660cfa240

And a few more things we should talk about:

Some of these optimizations are probably legit, others will likely be spoiled by uses of intermediate values that aren't apparent in the Souper code that you generated. This is an issue that we have in LLVM as well. The solution we're working on is to make these "external uses" visible in Souper IR at which point the synthesizer will take this into account. (This is the simple solution; it has the unfortunate property of allowing collections of interdependent optimizations to block each other. We know how to fix this as well but it'll be a bit more work.)
Souper has a cost model that it uses to determine whether a candidate RHS is better than its LHS. This cost model is somewhat guesswork since costs are not totally clear at the IR level. Moreover, the cost model that we've come up with is probably not right for WebAssembly. This is easy to tweak. If you want to look at our current (extremely simple) cost model, it is here: https://github.com/google/souper/blob/master/lib/Inst/Inst.cpp#L640

regehr commented 6 years ago

One more thing: as I mentioned earlier, I have a minimizer for LHSs that pares them down to the smallest code that justifies the given optimization. Let me know if you'd like me to run this before giving you results. It will have the unfortunate effect of ruining the backtracking information that you put into the generated Souper, so perhaps it would make sense to present both minimized and non-minimized versions.

Making Souper optimizations understandable to compiler developers is something I've put a lot of thought into, but I'd be happy to hear more thoughts about this. In the long run it is a non-issue since Souper will be turned into table-driven optimizers that humans won't have to look at :)

kripken commented 6 years ago

Makes sense about the depth parameter, thanks, I'll look into adding that.

Very interesting point about inlining, it seems not too hard to look into called functions (up to a depth limit or such), if we keep around the DataFlow IR for them. Worth looking into, I agree.

I updated all the builds in https://github.com/kripken/embenchen/tree/master/asm_v_wasm , and now they also have the bitcode files there. There are also separate wasm versions for asm2wasm and the LLVM wasm backend - aside from LLVM version differences, those should differ on only the backend conversion of LLVM IR to wasm, so any differences in the optimizations Souper finds would be interesting.

Some of these optimizations are probably legit, others will likely be spoiled by uses of intermediate values that aren't apparent in the Souper code that you generated.

I don't understand this, sorry?

In the long run it is a non-issue since Souper will be turned into table-driven optimizers that humans won't have to look at :)

That's the future I hope for :) I'd like to import large amounts of superoptimizer-generated optimizations in an automatic matter. I guess this ties into the fast matcher idea you mentioned before - I'm also very interested in that, interesting to think about.

Meanwhile I looked into more of the suspiciously trivial stuff we were emitting on lua, and after the two Binaryen PRs all that was left was a serious bug in phi generation, that turned out to need some large refactoring of the code. But now that that's done (pushed to the main branch), I don't see us emitting anything suspiciously trival on lua, and I think our phi handling is now correct.

Looking at your latest results, one optimization is 24 + 36 = 60, which also looks like a trivial thing that indicates either a bug or a missing optimization, I'll investigate...

kripken commented 6 years ago

Hmm, I don't see that 24 + 36 = 60 thing on lua, so perhaps it was only there because of the phi issue that was just fixed?

regehr commented 6 years ago

I'll rerun soon (but perhaps not until early next week, I'm leaving town early tomorrow).

Thanks for adding bitcode! It'll be really interesting comparing results across bitcode and wasm. I assume I can turn off inlining when I optimize wasm, to make it easier to map between the two?

Re. table-driven optimizers let's talk sometime. I think this is an important step forward in how we build these things. I'm happy to come out and spend half a day with y'all sometime.

Regarding external uses, see the bad diagram below. It shows a LHS with two add instructions that Souper would like to turn into a single add, for a win of 1 instruction. However, x+7 has an "outside use" that Souper doesn't capture, so if we implement the optimization we end up with a win of zero. We have some work in progress to capture these in Souper IR, in order to use the presence of outside uses to constrain the synthesizer. The analogous constraints in LLVM IR are the ubiquitous hasOneUse() checks that you see in InstCombine. Hope that made more sense this time.

img_20180426_153121

kripken commented 6 years ago

I assume I can turn off inlining when I optimize wasm, to make it easier to map between the two?

Those bitcode files are after LLVM inlining, but Binaryen may inline. To make it easy to map to the IR, I added a subdir with_names over there that has builds with function names in the binaries (which then show up in the --souperify output).

Re. table-driven optimizers let's talk sometime. I think this is an important step forward in how we build these things. I'm happy to come out and spend half a day with y'all sometime.

Sounds good!

Regarding external uses, see the bad diagram below.

I see now, thanks. Tricky, and, there are probably more such issues with wasm than with LLVM IR, as there are all kinds of structural effects from the surrounding code.

google / souper

Operate on WebAssembly? #323