vnmakarov / mir

A lightweight JIT compiler based on MIR (Medium Internal Representation) and C11 JIT compiler and interpreter based on MIR
MIT License
2.32k stars 149 forks source link

First use of MIR JIT in Ravi #11

Open dibyendumajumdar opened 5 years ago

dibyendumajumdar commented 5 years ago

Hi @vnmakarov

I implemented a PoC using MIR JIT as the backend for Ravi. You can see this here: https://github.com/dibyendumajumdar/ravi/commit/4e445f46bfe239db6a85d4b338dc343cd5103229

So far I have only tested basics ... looks promising.

Regards

dibyendumajumdar commented 5 years ago

So far it seems to be working fine with all my tests which is pretty impressive. I can give you performance figures if that is of interest. Right now it is behind LLVM and OMRJIT but better than NanoJIT. Compiled code runs 2 to 3 times faster than interpreted code (*). NanoJIT compiled code ran at the same speed as the interpreter, so this is much better result.

(*) This is when type annotations are used.

vnmakarov commented 5 years ago

So far it seems to be working fine with all my tests which is pretty impressive.

C2MIR in the repository can not still pass a bootstrap test. But finally I have a fix. So at the end of week I push changes for c2mir which will pass the bootstrap test.

I can give you performance figures if that is of interest. Right now it is behind LLVM and OMRJIT but better than NanoJIT. Compiled code runs 2 to 3 times faster than interpreted code (*). NanoJIT compiled code ran at the same speed as the interpreter, so this is much better result.

(*) This is when type annotations are used.

As I understood you use C to MIR compiler to generate binary code through MIR. I should say there are a lot of things to improve the compiler for generated code performance:

I am going to fix these issues in the near future so the performance of code generated by C to MIR compiler should be improved.

Also all calls are done through small thunks requiring 2 x86-64 insns which might slow down the code too. Usage of the thunks can be used for some applications to change called function code w/o changing a caller code. But may be it is not necessary for some applications and I could make an option to use a direct call. I will think about it.

Although on my evaluation C2MIR is about 10 times faster that GCC -O2, I should say that my major goal of C to MIR compiler was to make a simple implementation not the fastest one. I planned to use C2MIR only during building Ruby and generate only MIR during Ruby work. But you are the second person who is proposing to generate C code. So I should reconsider what should I do to improve such approach usage.

dibyendumajumdar commented 5 years ago

Hi hand-coding MIR is possible but is a lot of effort. I have a hand-coded version for LLVM and it is nightmare to maintain.

I can do it but I am not sure the issue is in the C front end. Interpreter code tends to be branchy and uses heap as stack so the problem might be else where. I think I may need to implement stack copies of the Lua stack to help the code gen; LLVM doesn't seem to need this, it seems to work out the required optimisations.

dibyendumajumdar commented 5 years ago

C2MIR in the repository can not still pass a bootstrap test. But finally I have a fix. So at the end of week I push changes for c2mir which will pass the bootstrap test.

Wow that's great progress.

dibyendumajumdar commented 5 years ago

BTW I have two LLVM implementations. Hand coded one with aliasing annotations (LLVM type based aliasing info). Second one uses the dmr_C front-end.

The handcoded version is fastest. The C based version is slower, and comparable to OMRJIT which also works off the C front-end. This leads me to believe that dmr_C is generating bad output from C code; unfortunately the Linux Sparse C front-end generates bad code when its 'optimizations' are enabled, so I have to switch all of that off. So I use the verbose IR and rely upon LLVM / OMRJIT to sort out i.e., simplify. However as mentioned this is probably not very effective.

I think your C front-end is at least being specifically designed for MIR and therefore I hope it can generate better MIR code than would be possible otherwise.

dibyendumajumdar commented 4 years ago

@vnmakarov Looks like the latest commits have improved performance for Ravi - possibly due to the loop optimisation?

vnmakarov commented 4 years ago

@vnmakarov Looks like the latest commits have improved performance for Ravi - possibly due to the loop optimisation?

I think there are a lot of factors:

I added some simple benchmarks to see MIR-generator performance vs GCC and Clang with -O2. You can see the current state with make c2mir-bench and thoughts what could improve MIR-generator performance further in c2mir/README.md

I might implement more optimizations but my major goal is still to keep MIR-generator small and fast.

dibyendumajumdar commented 4 years ago

@vnmakarov Thank you for the information. I thought at this stage it might be useful to share some performance figures with you. The tests are micro benchmarks - essentially Ravi versions of some of the tests in the language shootout benchmark.

First of LLVM 8 with hand-coded IR generation (including type based aliasing info, and some use of optimizer hints on branches):

a) Simple for loop (1000000000 iterations): 0 secs b) Fannkuchen(11): 3.36 secs c) matrix multiplication (1000): 0.98 secs d) mandelbrot(4000): 1.08 secs e) sieve: 5.65 secs

Next LLVM 8 with dmr_C based C code converted to IR.

a) Simple loop: 0.87 secs b) Fannkuchen(11): 3.28 secs c) matrix multiplication (1000): 2.42 secs d) mandelbrot: 1.58 secs e) sieve: 4.91 secs

Next OMRJIT with dmr_C based C code converted to IR:

a) Simple loop: 0.88 secs b) Fannkuchen(11): 4.6 secs c) matrix multiplication (1000): 2.28 secs d) mandelbrot: 2.02 secs e) sieve: 6.86 secs

Finally, c2mir backend:

a) Simple loop: 0.87 secs b) Fannkuchen(11): 5.87 secs c) matrix multiplication (1000): 2.4 secs d) mandelbrot: 2.5 secs e) sieve: 7.18 secs

Lastly Ravi interpreter.

a) Simple loop: 3.0 secs b) Fannkuchen(11): 14.76 secs c) matrix multiplication (1000): 8.26 secs d) mandelbrot: 7.38 secs e) sieve: 18.32 secs

All tests use type-annotations in Ravi. All tests were done on RHEL 7.7 running on Xeon X86-64.

I think this indicates fantastic result for MIR which is small (800k compiled) and yet achieves performance approaching much much bigger compiler tools. Kudos!

vnmakarov commented 4 years ago

Thank you very much for the interesting data. I really appreciate that.

It is interesting for me to see the result for simple loop for LLVM8. I suspect it is because of successful scalar evolution optimization which basically evaluates the loop during compilation. In practice, there are very few opportunities for this optimizations in real world programs.

I see Ravi interpreter is really fast. In my experience, interpreters for static type languages are usually about 5-6 times slower than simple native code generated code.

dibyendumajumdar commented 4 years ago

The for loop result for LLVM is good because it decides there is no need to loop! This would be trivial in C code but interpreter code is harder to figure out.

The only anomaly is the sieve result with handcoded LLVM - the C intermediate version does better because the hand-coded version is missing some optimizations to do with conditional statements.

Re Ravi interpreter performance, please note this is with type annotations, so without type info, performance will drop. Even JIT is very ineffective without type info.

dibyendumajumdar commented 4 years ago

Just for reference, here are the test programs:

https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/fornum_test1.lua https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/fannkuchen.ravi https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/matmul1_ravi.lua https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/mandel1.ravi https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/sieve.lua

dibyendumajumdar commented 4 years ago

I added some simple benchmarks to see MIR-generator performance vs GCC and Clang with -O2. You can see the current state with make c2mir-bench

Nice. I got one failure:

+++++ funnkuch-reduce 11 +++++
gcc -O2:                                 1.63         1.00x
./c2m -Ic-benchmarks -I. c-benchmarks/funnkuch-reduce.c -eg 11: FAILED
vnmakarov commented 4 years ago

Nice. I got one failure:

+++++ funnkuch-reduce 11 +++++
gcc -O2:                                 1.63         1.00x
./c2m -Ic-benchmarks -I. c-benchmarks/funnkuch-reduce.c -eg 11: FAILED

Thank you for reporting this. On my computer I have no failure. But when I used valgrind, I found a bug in a new combiner pass. In any way I fixed it, the patch is in the repository now.

dibyendumajumdar commented 4 years ago

Nice. I got one failure:

+++++ funnkuch-reduce 11 +++++
gcc -O2:                                 1.63         1.00x
./c2m -Ic-benchmarks -I. c-benchmarks/funnkuch-reduce.c -eg 11: FAILED

Thank you for reporting this. On my computer I have no failure. But when I used valgrind, I found a bug in a new combiner pass. In any way I fixed it, the patch is in the repository now.

Okay thank you. I won't have a chance to try this out until the weekend - I will report back after testing.

pmatos commented 4 years ago

@dibyendumajumdar Why use C2MIR in Ravi and not the MIR API?

dibyendumajumdar commented 4 years ago

Hi a large part of the issue is correctly interfacing with c data structures. With MIR API I would have to deal with this myself. C interface helps by handling this work. It is also more effort to change a lower level implementation... So makes change more difficult.

dibyendumajumdar commented 4 years ago

Hi,

I re-ran some perf tests today - the same ones I reported back in Nov 2019.

MIR (-O2)

a) Simple loop: 0.87 secs b) Fannkuchen(11): 5.21 secs c) matrix multiplication (1000): 2.46 secs d) mandelbrot: 2.53 secs e) sieve: 7.74 secs

MIR -O3

a) Simple loop: 0.88 secs b) Fannkuchen(11): 5.29 secs c) matrix multiplication (1000): 2.47 secs d) mandelbrot: 2.51 secs e) sieve: 7.88 secs

LLVM 10 (-O2)

a) Simple for loop (1000000000 iterations): 0 secs b) Fannkuchen(11): 3.55 secs c) matrix multiplication (1000): 0.92 secs d) mandelbrot(4000): 1.04 secs e) sieve: 5.75 secs

Observations:

1) No material difference in -O2 vs -O3 in MIR 2) Performance figures are largely similar to those reported in Nov 2019.

vnmakarov commented 4 years ago

Hi,

I re-ran some perf tests today - the same ones I reported back in Nov 2019.

Thank you, Dibyendu. I don't like the current MIR speed compilation and generated code performance. I'll probably rewrite most optimizations to SSA and remove/add some optimizations which works in -O3 mode.

I've created ssa branch and re-implemented SCCP on SSA which improved compilation time. I don't know how fast I'll move forward in this direction. Probably the work will continue all 2020 year. My goal is to keep the same code size but to improve compilation speed and generated code quality.

dibyendumajumdar commented 4 years ago

I don't like the current MIR speed compilation and generated code performance. I'll probably rewrite most optimizations to SSA and remove/add some optimizations which works in -O3 mode.

I thought MIR already generated much better code than anything of this size. Also an interpreter code poses many challenges because of heavy use of branching for type checks. If I could write MIR backend by hand for Ravi then maybe I would get better performance too, but I don't fancy trying to implement all the logic for accessing C structures.

I've created ssa branch and re-implemented SCCP on SSA which improved compilation time. I don't know how fast I'll move forward in this direction. Probably the work will continue all 2020 year. My goal is to keep the same code size but to improve compilation speed and generated code quality.

I noticed the ssa branch. I hope it pays off. I do have a suggestion. I would recommend using MIR as is for Ruby to see whether you need specific optimizations. Apologies if you are already doing this. No point optimizing without hard evidence in particular from the original use case.

Thanks and Regards Dibyendu

vnmakarov commented 4 years ago

I noticed the ssa branch. I hope it pays off. I do have a suggestion. I would recommend using MIR as is for Ruby to see whether you need specific optimizations. Apologies if you are already doing this. No point optimizing without hard evidence in particular from the original use case.

Currently I am investigating what optimizations besides RA and code selection are particularly valuable. I spent a lot of time implementing register-pressure sensitive invariant code motion (to improve matrix multiplication) but it did not results in better code. Knowing importance of RA, I've implemented register renaming (it could be considered as a RA free cost live-range splitting) with the same results. So they might go away. But it is usual work with optimizations (a lot of failed tries).

You are right. I fell the same that I should start work on actual Ruby JIT implementation to get more real feedback. I'll probably start working on it in week or two. But it will be private repository which will be open only after I get initial implementation and some encouraging results.

dibyendumajumdar commented 4 years ago

You are right. I fell the same that I should start work on actual Ruby JIT implementation to get more real feedback. I'll probably start working on it in week or two. But it will be private repository which will be open only after I get initial implementation and some encouraging results.

It would interesting to hear what you find.

For Ravi my conclusion is I need to generate more JIT friendly code. For example, where possible use C stack instead of the heap. Right now the Ravi stack is on heap - except for loop index variables which I optimize to stack.

dibyendumajumdar commented 4 years ago

Hi @vnmakarov

I applied the latest MIR code to Ravi today. Here are updated numbers from my tests

a) Simple loop: 1.08 secs b) Fannkuchen(11): 5.47 secs c) matrix multiplication (1000): 2.56 secs d) mandelbrot: 2.54 secs e) sieve: 8.93 secs

My impression is that performance has regressed somewhat.

Regards

vnmakarov commented 4 years ago

a) Simple loop: 1.08 secs b) Fannkuchen(11): 5.47 secs c) matrix multiplication (1000): 2.56 secs d) mandelbrot: 2.54 secs e) sieve: 8.93 secs

My impression is that performance has regressed somewhat.

Thank your for the data and sorry for delay with the answer (I was on a vacation). On my tests I found only 1% performance regression. The change for -O2 (default) was in changing CSE to GVN. May be this is the problem or SSA variant optimizations do not work as well as previous non-SSA ones. I'll investigate what is going on.

dibyendumajumdar commented 4 years ago

Thank you @logzero for the work getting MIR to compile on Windows with MSVC. I am able to compile and use MIR now in Ravi for Windows.

@vnmakarov I do get some Lua test failures - I haven't analyzed yet but it seems to do with numeric values (perhaps overflow issues?) I will report back when I have analyzed a bit more.

dibyendumajumdar commented 4 years ago

Thank your for the data and sorry for delay with the answer (I was on a vacation). On my tests I found only 1% performance regression. The change for -O2 (default) was in changing CSE to GVN. May be this is the problem or SSA variant optimizations do not work as well as previous non-SSA ones. I'll investigate what is going on.

Hi @vnmakarov
Is it at all possible to maintain the non-ssa version in a branch? It seems we have lost the ability to compare the performance of the two approaches?

Regards

vnmakarov commented 4 years ago

Is it at all possible to maintain the non-ssa version in a branch? It seems we have lost the ability to compare the performance of the two approaches?

I've created branch non-ssa which is basically the current trunk with the generator before merging ssa branch.

dibyendumajumdar commented 4 years ago

@vnmakarov Thank you. I will do a comparison over the weekend. of some of my tests, and share the inputs (C) / outputs (MIR).

dibyendumajumdar commented 4 years ago

I am attaching the intermediate C file, and output from ssa/nonssa branch. I don't know if any obvious pattern is present or not. Anyway, would like any feedback. sieve.nossa.txt sieve.ravi.c.txt sieve.ssa.txt

dibyendumajumdar commented 4 years ago

Hi @vnmakarov I have been working on a new code generator in Ravi to help the JIT. The idea is to use C stack / C vars where possible, e.g. when a value holds primitive and is not GC-able. With this generator I expect to get better results from MIR. I am attaching an output from this new generator for the sieve program written in Ravi. Even in this case I see better results from non-ssa branch compared to ssa - difference is like: 6.7 secs vs 7.5 secs. nsieve_c.txt

vnmakarov commented 4 years ago

Thank you for sending these files.

So far I compared optimizations before RA. My impression that SSA pipeline works better than non-SSA (generating ~760 insns vs ~800 insns).

So I guess the problem in RA or code selection. May be I did some changes in them on SSA branch which affect the code becuase the final result for SSA is worse (553 insn vs 539).

Or it may be additional copy propagation in SSA which removes natural live range splitting points and the final RA is worse. I'll investigate it more. It will take some time.

It would be great to have a stand-alone program (with main) in MIR or C to experiment and run it but I guess it is very hard to produce such stand-alone program.

dibyendumajumdar commented 4 years ago

Hi @vnmakarov

Thank you for having a look at this.

It would be great to have a stand-alone program (with main) in MIR or C to experiment and run it but I guess it is very hard to produce such stand-alone program.

Yes that is a bit hard because the resulting code has dependency on the runtime.

I have been working on the C intermediate code - and I found that if I further improve the C code (reduced branching on the loop condition) then the ssa version performs better ...

Also while results from the Ravi code has generally been better in the non-ssa branch, some of the slower plain Lua examples did better in the ssa version. So it seems that the ssa branch is not always worse.

vnmakarov commented 4 years ago

I've improved generated code a bit. The final number of insns for sieve test for SSA is now less than for non-SSA. I don't know how it will affect the code performance.

dibyendumajumdar commented 4 years ago

@vnmakarov Thank you

I have updated Ravi and here are the latest numbers (best out of 3 runs):

SSA

a) Simple loop: (Previous 1.08) Now 0.87 secs b) Fannkuchen(11): (Previous 5.47) Now 5.37 secs c) matrix multiplication (1000): (Previous 2.56) Now 2.59 secs d) mandelbrot: (Previous 2.54) Now 2.55 secs e) sieve: (Previous 8.93) Now 7.75 secs f) New Ravi code gen sieve: 5.35 secs

Non-SSA

a) Simple loop: 1.09 secs (this has degraded ?) b) Fannkuchen(11): 5.25 secs c) matrix multiplication (1000): 2.46 secs d) mandelbrot: 2.58 secs e) sieve: 7.42 secs f) New Ravi code gen sieve: 7.35 secs

My observation from above - the new Ravi code generator that increases use of C stack variables is benefiting more from the SSA pipeline.

The SSA pipeline has improved after your changes but for the older code gen (more branching, less use of C vars) - for these tests at least - the non-SSA branch is still doing better than the new SSA - but it is now close.

dibyendumajumdar commented 4 years ago

@vnmakarov btw I have not noticed any change in the code that uses floating point ops - such as matrix multiplication

dibyendumajumdar commented 4 years ago

I have some unexpected perf result with attached. Ex2 takes 6.55 secs versus Ex1 5.23 secs - but Ex2 is 'better input' - in the sense that it uses a C stack value in place of a heap value in one condition.

Any idea why this is happening?

ex1.txt ex2.txt

dibyendumajumdar commented 4 years ago

With my new C code generator - the matrix multiplication test is showing better results: Both tests with SSA pipeline:

vnmakarov commented 3 years ago

I have some unexpected perf result with attached. Ex2 takes 6.55 secs versus Ex1 5.23 secs - but Ex2 is 'better input' - in the sense that it uses a C stack value in place of a heap value in one condition.

Any idea why this is happening?

I saw unusual performance changes which hard to explain a lot of time. Basically, modern x86_64 CPUs are black boxes, a lot of things are going hidden. It is very hard to predict or to explain their behaviour.

Sometimes code and data alignment can affect a lot. Sometimes a combination out-of-order pipeline state could be a culprit. Reading memory in advance can affect the performance.

We are just following Intel optimization guidelines to generate a better code and sometimes it does not work.

vnmakarov commented 3 years ago

With my new C code generator - the matrix multiplication test is showing better results: Both tests with SSA pipeline:

* Old code gen: 2.59 secs

* New code gen: 0.99 secs

Wow. Almost 3 times is a big change.

Using local variables (which are translated into MIR regs) is important. But it works only for scalars right now. For structures, memory will be used. Sometimes values in dynamic language implementations (e.g., in MRuby) are small structures. Therefore I am thinking about implementation of structure scalarizations. But I can start this work only in the next year.