The curse of "high-impact medium-generality" optimizations

lukego commented 6 years ago

We can rate software optimizations as low/medium/high in terms of their impact and their generality.

Impact is how much difference the optimization makes when it works; generality is how broadly applicable the optimization is across different situations. Thinking about these factors explicitly can be helpful for predicting how beneficial (or harmful!) an optimization will be in practice.

That's putting it mildly but this blog entry is really a rant: I hate high-impact medium-generality optimizations. Let me show you three examples to explain why. The first is drawn from CPU design for the purposes of illustration and the second two are specific problems that I want to fix in RaptorJIT.

CPUs and memory alignment

CPU hardware tends to operate on fixed-size data at fixed alignments in memory. Memory is accessed in fixed-size blocks (cache lines) with fixed alignment, 64-bit values are stored at 64-bit aligned addresses whenever possible, and so on. CPU hardware optimizations are easier to implement when they can make assumptions about alignment. If the alignment is not known then there are a whole bunch of additional corner cases to worry about.

So what approaches to CPUs take when it comes to accessing "unaligned" data? There are a few:

Make unaligned access illegal.
Make unaligned access slow.
Make unaligned access fast.

The "low generality" approach is to completely outlaw unaligned memory access. This punts the problem to somebody else: the compiler or the programmer. ARM was famous for this.

The "medium generality" approach is to support unaligned access but to make it slow. This half-punts the problem: the compiler and programmer are free to write straightforward code, but they may be surprised to find that their software runs much faster or slower depending on the addresses of certain data structures, and that may motivate them to write a bunch of tricky special-case code. Intel SIMD instructions have worked this way in earlier silicon implmentations.

The "high generality" approach is to support unaligned access with the same high performance as aligned access. This takes ownership of the problem: memory access is fast, period. Compilers and programmers don't need to worry about the low-level special cases: they can go back to thinking about the problem they actually care about. Intel SIMD instructions mostly work this way in later silicon.

I hope it is obvious that the high-impact high-generality approach is the best by far. The programmer writes straightforward source, the compiler assembles straightforward machine code, and the benchmarks show the performance straightforwardly.

Less obvious is that the high-impact medium-generality approach really sucks. It provides a false sense of security. You can write your nice straightforward code but this punts the complexity to your benchmarks: simple testing will probably exercise the sweet spots, making everything look great, but more thorough testing will reveal uncertainty and unpredictability. (Sad face.) If you care about worst-case performance, and you don't have control over the layout of your data in memory, then this can defeat the whole purpose of having a fast path in the first place: you can't depend on that optimization and you need to constantly worry about it screwing up your performance tests.

(The high-impact low-generality approach is probably fine most of the time, since you won't be tricked into thinking it works when it doesn't, though for this example of simply accessing data in memory it's not much fun.)

Tracing JIT and loop optimization

RaptorJIT (via LuaJIT) has a feature called loop optimization. This is a very powerful optimization. However, it is one of these awful "high-impact medium-generality" features. It is completely awesome when it works, but it is really hard to predict whether it will be there when you need it.

The idea of loop optimization is simple: on the first iteration of a loop you simply execute the loop body, but for the later iterations you skip anything that you already did in the first iteration. This cuts out a whole bunch of work: assertions that don't need to be rechecked, calculation results that are already in registers, side-effects that have already been effected, and so on.

The gotcha is that it is only applicable when the later iterations execute exactly the same way as the first one. Your control flow has to be the same, so you need to always be taking the same branch of your if-then-else statements, and the types of your local variables have to be the same too. On any iteration where these conditions don't hold you will get an exit to a slow path that runs without the optimizations.

Let us look at a simple example of loop optimization. Here is a simple Lua program with a loop that repeatedly stores the current loop counter into a global variable:

for i = 1, 100 do
   myglobal = i
end

Here is the machine code for the first iteration of the loop:

12a0eff65  mov dword [0x00021410], 0x1
12a0eff70  cvttsd2si ebp, [rdx]
12a0eff74  mov ebx, [rdx-0x8]
12a0eff77  mov ecx, [rbx+0x8]
12a0eff7a  cmp dword [rcx+0x1c], +0x3f
12a0eff7e  jnz 0x12a0e0010  ->0
12a0eff84  mov edx, [rcx+0x14]
12a0eff87  mov rdi, 0xfffffffb00030620
12a0eff91  cmp rdi, [rdx+0x320]
12a0eff98  jnz 0x12a0e0010  ->0
12a0eff9e  lea eax, [rdx+0x318]
12a0effa4  cmp dword [rcx+0x10], +0x00
12a0effa8  jnz 0x12a0e0010  ->0
12a0effae  xorps xmm0, xmm0
12a0effb1  cvtsi2sd xmm0, ebp
12a0effb5  movsd [rax], xmm0
12a0effb9  test byte [rcx+0x4], 0x4
12a0effbd  jz 0x12a0effd4
12a0effbf  and byte [rcx+0x4], 0xfb
12a0effc3  mov edi, [0x000213f4]
12a0effca  mov [0x000213f4], ecx
12a0effd1  mov [rcx+0xc], edi
12a0effd4  add ebp, +0x01
12a0effd7  cmp ebp, +0x64
12a0effda  jg 0x12a0e0014   ->1

That is quite a lot, eh! That's because we are finding the globals hashtable and locating the myglobal slot. This requires quite a few instructions to do correctly, taking into account that this is a dynamic language and the table could have been updated or replaced after the machine code was compiled, which we need to detect to know if the hashtable slot we want to look in is valid, and so on.

Here is what the subsequent iterations look like thanks to loop optimization:

->LOOP:
12a0effe0  xorps xmm7, xmm7
12a0effe3  cvtsi2sd xmm7, ebp
12a0effe7  movsd [rax], xmm7
12a0effeb  add ebp, +0x01
12a0effee  cmp ebp, +0x64
12a0efff1  jle 0x12a0effe0  ->LOOP
12a0efff3  jmp 0x12a0e001c  ->3

That's a bit better, right? We take our loop index in ebp, convert that into a Lua number (double float) in xmm7, store that in the hashtable slot whose address we already loaded into rax on the first iteration, and then we bump our index and check for loop termination.

Great stuff, right? Wrong! The trouble is that loop optimization is only medium-generality. There are myriad ways that we can tweak this simple example to throw a spanner in the works. All it takes is to have some variation in either the types used in the loop or the branches that are taken.

Let me show you a fun example of screwing it up:

-- Fast version
for i = 1, 1e9 do
   -- Store loop index in global
   myglobal = i         
end

-- Slow version
for i = 1, 1e9 do
   -- Store "is loop index even?" flag
   myglobal = i%2 == 0
end

You might reasonable expect that both of these loops would run at about the same speed, with the second one paying a modest price for doing some extra arithmetic. But you would be wrong, wrong, wrong, wrong, wrong. On both counts.

The first version averages 1.5 cycles (6 instructions) per iteration while the second version averages 9.1 cycles (22.5) instructions per iteration. That's a six times slowdown.

The reason is not extra arithmetic: it's types. Let me explain in two halves. First, the JIT uses a statically typed intermediate representation for generating machine code: it always knows the exact type of the value in each register. Second, the JIT does not have the concept of a boolean type: a value can be typed as true or as false but not as true-or-false. This means that the JIT needs to generate two copies of the loop body, one for even iterations and one for odd iterations, and the constant branching between wipes out the benefit of loop optimization.

Clear as mud?

I see this as a massive problem that I want to fix in RaptorJIT. I don't want to have high-impact medium-generality optimizations. It's hostile to users and it violates the principle of least astonishment. I am thinking very hard about how to solve this.

One approach would be to upgrade the feature to high-impact high-generality. This would require making the optimization effective even when the branches and types are diverse. This seems superficially like a hard problem but may have a pragmatic solution.

Another approach would be to downgrade the feature to medium-impact medium-generality. This would be done by making separate optimizations to speed up the bad case so that the relative importance of loop optimization is less. Then application developers don't have to worry so much about whether loop optimization "kicks in" because they will get pretty fast code either way. This could be done by doing more of the work during the JIT phase before the program runs: perhaps we insert the hashtable slot address directly into the machine code, without any safety checks, and separately we ensure that if the hashtable is resized that this will cause the machine code to be flushed and replaced.

Life is not boring when you are maintaining a tracing JIT, let me tell you.

Tracing JIT and allocation sinking

"Allocation sinking" is another problematic high-impact medium-generality optimization. It allows you to create complex objects and have them stored purely in registers, without ever allocating on the heap or running the garbage collector. Except when it doesn't... which can be very hard to predict in practice.

I have a plan to downgrade allocation sinking to medium-impact medium-generality by optimizing the slow path. Specifically I want to switch the heap from using 64-bit words to using 96-bit words so that typed pointers can be stored without a heap allocation (boxing.) However, this blog entry has become rather long, so if you want to know about that you'd better look at https://github.com/raptorjit/raptorjit/pull/93.

Summary

There is no permanent place in the world for high-impact medium-generality optimizations. They are simply too hostile to users.

In the long run you need to either make them very general so that users can depend on them working, or you need to separately optimize the edge cases to soften the impact when the special case does not apply.

This is a major topic in RaptorJIT development! You are welcome to come and help with that project if you like :-)

P.S. Please leave a comment with your most hated examples of high-impact medium-generality optimizations in other domains!

charlie-ht commented 6 years ago

I feel like high-impact medium-generality describes C++. ;)

Definitely more examples in loop optimizations that just loop invariant code motion, I've had a lot of pain in the past trying to get my compiler to generate what I felt it should, often that's because I'm wrong, but sometimes it's just because of medium-generality.

The more general problem in production compilers I feel is the liberal use of heuristics to guide the optimizations, and those things are essentially just random numbers someone hopefully got right from detailed micro-benchmarking. That feels like a necessary evil, but when you sit down and try to form a neat and reusable (in future self) conception of how it works, you can't which is definitely uncomfortable.

jimblandy commented 6 years ago

I wholeheartedly agree that the performance cliffs these optimizations create are extremely difficult for developers to understand and avoid reliably. Almost all the optimizations that make modern JavaScript usable fall into this category, and JavaScript perf is notoriously brittle.

But if you're targeting machine code, and you want the developer to be able to predict how fast their code will run, to me that sounds a lot like saying that a) there is a straightforward mapping between source language and machine language features, and b) the compiler can generally be trusted to follow that mapping. In other words, it sounds like C (before it got quite so drunk on undefined behavior). If your source language is highly dynamic, there's no obvious fit onto machine features, and it's not clear what sort of promises you can craft for developers to learn and internalize. (And I guess machine code can be highly dynamic, but it has a rather different character, no?)

tl; dr: Sure, we all like ponies. But if you do figure something out, it'll be a big hit!

jimblandy commented 6 years ago

I'm not familiar with Lua, so I'm making some assumptions here, but:

I'd argue some of the problem lies with the language, in that it gives you a global hash table instead of global variables. Developers naturally simplify their mental model to treat them as global variables, because that's what they want and need, and because it's horrible style to actually use the global scope as a hash table. And thus arises a gap between the developer's expectations ("this assignment is just a store at the machine level") and the compiler's obligations ("this is a hash table lookup, whose bucket we can probably predict with high impact and medium generality").

Any time the language specifies something very flexible, which is then used heavily in a restricted way (objects in JavaScript are another example), it invites disappointment when the developer internalizes the restricted case and assumes it will always be handled correctly. And it stems from the language design. The JIT is just trying to salvage a difficult situation.

catwell commented 6 years ago

In practice, an experienced Lua developer would probably not use global variables like this anyway (there is tooling to avoid it).

In recent Lua versions (since 5.2) global variables don't really exist anymore, and the global scope is just the default environment. However LuaJIT is still stuck with Lua 5.1 semantics regarding this.

The second issue (true and false being different types in LuaJIT, and the use of different types breaking traces) is a lot more annoying.

The Lua community has been aware of this LuaJIT "high-impact medium-generality" issue for a while. The LuaJIT approach is both to try to reduce the impact by having a fast interpreter mode, and to increase the generality by trying to have fewer things break traces.

However, LuaJIT performance is still harder to predict than that of a static, compiled language, and among other things this has led to LabLua starting work on the Titan language (see the Workshop talk).

lukego commented 6 years ago

@catwell Thanks for the links and the "real" Lua perspective!

My plan with RaptorJIT is to start from LuaJIT, take away the sharp edges, add good tooling, and add a few key optimizations e.g. make globals as fast as locals and make 64-bit values/pointers unboxed. Then I have the language that I want :). Meanwhile LuaJIT is already very close.

If anybody else is interested in taking the LuaJIT approach forward they are welcome to get in touch with me and join forces :). I have come quite far already and will make a presentation at some point.

lukego commented 6 years ago

@jimblandy Sounds like a balanced take :). I do think that I have a couple of new things figured out, which can make a big improvement on LuaJIT in my domain (low-level system programming), and I hope those will indeed hit it big. We will see, watch this space :).

lukego commented 6 years ago

@catwell Titan paper talks about LuaJIT having a 10x difference between JIT vs interpreted code. Interesting! I would ballpark it more like 100x. I see around 10x difference within good vs. bad cases in the JIT.

I see the interpreter as something that is used during startup and for handling exceptions. Even the LuaJIT one is simply way too slow for our applications. The real performance impact of the interpreter is in maintainability e.g. how much time does maintaining the interpreter distract me from optimizing the JIT, how often does the complexity of the interpreter implementation prevent me from trying out a new idea, etc. So I want to replace the complex asm interpreter with a simple one in C... which I believe will make it easier for us to speed up real applications.

lukego commented 6 years ago

NB: Just a couple of weeks after this blog entry I chased a performance problem here: https://github.com/raptorjit/raptorjit/pull/150. This was an optimization that had introduced a separate fast-path and slow-path, which was invisible and unpredictable to the application programmer, and where the "fast-path" code had bit-rotten over ~ 7 years to actually run slower than the "slow-path."

lukego / blog