snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.96k stars 298 forks source link

Mastering LuaJIT & reducing sensitivity to side-traces #1100

Open lukego opened 7 years ago

lukego commented 7 years ago

Modest proposal: Let's become true LuaJIT experts!

Specifically let's understand all the relevant details well enough to write a really simple HOWTO for Snabb hackers to write predictably fast code. I bet that we can come up with a very simple answer but that formulating it will be a hard intellectual challenge. "Hard fun!"

This issue is to explore one specific aspect of JIT: how to arrange our code so that we are not sensitive to side-trace behaviour. I will also link a couple of relevant LuaJIT issues where I have been looking into this in the background.

LuaJIT deep dive warning!

Example

Consider this humble piece of code:

for i = 1, numpackets do
   packet.free(link.receive(input))
   ...
end

The first line of the loop looks innocent but the dynamic nature of Lua potentially implies quite a lot of work:

So in principle we need to execute dozens of extra instructions beyond doing our actual work of moving a packet from a link to a freelist.

Present situation

How does this work out in practice? There are really three different times when that work can be done:

  1. On each iteration of the loop. This is expensive. For example it may cost 20 CPU cycles per iteration.
  2. On the first iteration of the loop only. This is cheap because the cost is quickly amortized over the subsequent iterations. If the work costs 20 cycles and you iterate 50 times then average cost is only 0.4 cycles per iteration.
  3. Just once at JIT-compilation time. This is essentially free.

So how does this behave in real Snabb applications? I would say that usually we get the nice second case due to LuaJIT's "loop optimization" that compiles separate code for iteration 1 (slow) and iteration n>1 (fast - reusing work from first iteration.) However, we also often get the nasty first case that repeats work on each iteration due to if branches in our loops that break out of the loop optimization. One example of the fast case is lukego/blog#6 and of the slow case is lukego/blog#8. This uncertainty causes much gnashing of teeth about "side traces" (see e.g. https://github.com/LuaJIT/LuaJIT/issues/218).

Let me illustrate the situation with some data. Here is a performance comparison on snabbnfv showing the impact of disabling the loop optimization entirely (-O-loop) so that the worst case is always triggered:

noloop

This causes around a 50% performance drop. So this is a direct measurement of how dependent that application is to the LuaJIT loop optimization eliminating work.

Possible way forward

So - new idea - how about if we could trigger the third case where the JIT does all of the work at compile time and inserts the results directly into the machine code as constants? This makes intuitive sense because it actually is always the same module definitions that we are looking up in practice and there is no need to repeat this work all the time. If we could eliminate this busy-work entirely then it would make the code run faster even when loop optimization is not applicable and make us less sensitive to the whole side-trace issue.

Sounds good eh? Initial experiments on microbenchmarks are promising (see https://github.com/LuaJIT/LuaJIT/issues/248#issuecomment-273841796) but more investigation is needed. I think the first major step would be to optimize the benchmark shown above i.e. make a Snabb application run fast even when loop optimization is disabled (and without putting lots of boring annotations in the source like local link_transmit = link.transmit etc that I find unacceptably ugly.)

fonghou commented 7 years ago

Hey @lukego, I learned a lot from your write-up like this one!

Have you looked into http://terralang.org ? It uses LuaJIT as compile-time macro language, as well as hosting language at runtime (LuaJIT + llvm).

Cheers!

lukego commented 7 years ago

Thanks for the kind feedback @FongHou!

Terra looks interesting. Lately I am happy simply programming though and I don't feel the need for a metaprogramming language. Snabb-wise I think a great strength of LuaJIT is that you can learn the language so quickly by reading Programming in Lua. If you already know Python/Ruby/Scheme/Smalltalk/etc then you can get up and running very quickly. I am not sure the same is true of "high brow" languages like Terra, Rust, C++, Haskell, etc but maybe that's my own background speaking.

Just need to write an excellent HOWTO for making the tracing JIT do what you want it to & keep fleshing out the associated profiling and benchmarking infrastructure :).