nim-lang / RFCs

A repository for your Nim proposals.
136 stars 23 forks source link

[RFC] [performance] {.inline.} guarantees, `--stacktrace:inline`, and `--ignoreInline` #198

Open timotheecour opened 4 years ago

timotheecour commented 4 years ago

some of these (especially P1) will result in a large speedup for builds with --stacktrace:off (eg debug builds)

links

RFC proposals

we should investigate to make sure it works as intended, and wheter __attribute__((always_inline)) is more appropriate (cf hints vs guarantees)

// in inline_definitions.h
inline double dabs(double x) {  return x < 0.0 ? -x : x;}
// in exactly one compilation unit inline_definitions.c
extern inline double dabs(double x);

Nowadays, where the C99 standard still isn’t completely implemented in many compilers, many of them still get it wrong. Gcc corrected its first approach and now (since version 4.2.1) works with the method above. Others that mimic gcc like opencc or icc are still stuck on the old model that gcc had once invented. The include file p99_compiler.h of P99 helps you make sure that your code will work even with these compilers

Clyybber commented 4 years ago

So we make debugging harder, introduce a bunch of new arguments/pragmas that make everything unneccessarily more complex, instead of just using the already existing {.push: stacktraces: off.} when you really need it.

timotheecour commented 4 years ago

so your solution when a program (eg, nim itself) compiled with --stacktrace:on is too slow is to:

You can't just permanently add {.push: stacktraces: off.} in each suspected location where it could be a bottleneck, that would affect debugability of those locations with no easy way to get traces (unless again, you edit the sources).

I don't see how that's practical, and that's obviously much more complex for users than what I've proposed and implemented: simply compile with --stacktrace:on/off/inline depending on what stacktraces you want, this should cover the vast majority of cases because inline procs happen to be the performance sensitive ones (eg very basic things like math operations etc). {.push: stacktraces: on.} is always an option beyond tha (profile based using PGO could also be an option, but that's out of scope for this discussion)

So we make debugging harder

no, there's a mode that preserves pre-existing semantics; it's just a question of whether to default to old mode or to new mode.

Araq commented 4 years ago

All that's really needed IMHO is that we map Nim's .inline to C++'s always_inline.

timotheecour commented 4 years ago

well I've updated https://github.com/nim-lang/Nim/pull/13582 so that existing semantics are unchanged, instead the new --stacktrace:noinline can give you large speedups while typically producing identical stracktrace, ie, a good compromise bw speed and debuggability, see https://github.com/nim-lang/Nim/pull/13582#issuecomment-594617950 for details

timotheecour commented 4 years ago

I'd be happier with templates instead of inline procs as inline procs are currently not inlined in debug builds (from https://github.com/nim-lang/Nim/pull/13536#issuecomment-593911621) All that's really needed IMHO is that we map Nim's .inline to C++'s always_inline.

I did some further investigation, here are the conclusions using clang on OSX (should mostly generalize to gcc on linux):

change N_INLINE from inline to __attribute__((__always_inline__))

./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on -f compiler/nim.nim bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim => 113.9 seconds


* however, in combination with `--passC:-Og` (equivalent to `--passC:-O1`),  `__attribute__((__always_inline__))`  does do inlining whereas `inline` does not.

* passing `--passC:-O` or `--opt:speed` greatly improves speed of debug builds (builds without -d:release or -d:danger)
./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on --opt:speed -f compiler/nim.nim
bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim  => 19 seconds

here's a minimal but illustrative example:
```nim
when defined case5:
  import times
  when defined case_normal:
    proc fun[T](x, i: T): T = x + i
  when defined case_inline:
    proc fun[T](x, i: T): T {.inline.} = x + i
  when defined case_noinline:
    proc fun[T](x, i: T): T {.noinline.} = x + i
  when defined case_alwaysinline:
    # modify `N_INLINE` in nimbase.h and replace `inline` by `__attribute__((__always_inline__))`
    proc fun[T](x, i: T): T {.inline.} = x + i
  when defined case_template:
    template fun[T](x, i: T): T = x + i

  proc main() =
    var x: uint = 1
    let n: uint = 1_000_000_000
    var i = uint(0)
    while true:
      x = fun(x,i)
      if i == n: break
      i = i + 1
    doAssert x != 1234
  let t = cpuTime()
  main()
  echo cpuTime() - t

when defined case5_runner:
  import os,strformat
  proc fun(opt: string) =
    const nim = getCurrentCompilerExe()
    let input = currentSourcePath
    let cmd = fmt"{nim} c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f {opt} {input}"
    echo cmd
    doAssert execShellCmd(cmd) == 0

  for opt2 in ["", "--passC:-Og", "--passC:-O", "-d:danger"]:
    echo "\nopt2: ", opt2
    fun fmt"-d:case_normal {opt2}"
    fun fmt"-d:case_inline {opt2}"
    fun fmt"-d:case_noinline {opt2}"
    fun fmt"-d:case_template {opt2}"
    # fun fmt"-d:case_alwaysinline {opt2}"

performance numbers for debug builds

as you can see:

opt2: --passC:-Og /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.979641 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.172885 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.964913 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 9.999999999992654e-07

opt2: --passC:-O /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 9.999999999992654e-07 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.978355 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 2.000000000000265e-06

opt2: -d:danger /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.940134 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 2.000000000000265e-06



* none of this affects the validity of https://github.com/nim-lang/Nim/pull/13582, since https://github.com/nim-lang/Nim/pull/13582 is about improving speed of code with stacktraces even for optimized builds (it improves speed in all optimization modes)

## links
* https://indico.cern.ch/event/386232/sessions/159923/attachments/771039/1057534/always_inline_performance.pdf
* https://aras-p.info/blog/2017/10/09/Forced-Inlining-Might-Be-Slow/
Araq commented 4 years ago

It's not about speed really, .inline should inline. If it results in slower code than we misapplied the .inline pragma. It's about being more predictable.

timotheecour commented 4 years ago

It's not about speed really, .inline should inline. If it results in slower code than we misapplied the .inline pragma. It's about being more predictable.

this would be bad for performance. It's not as simple as that; source code for a proc declaration has very limited context to decide on whether to inline, unlike C compiler which has much more relevant context (compile flags, params resolved to concrete types, candidate insertion inside a inlining context); hence those articles I linked in previous post which show that force_inline may be slower than inline.

In fact, the same inline proc can end up being inlined or not in different translation units (perhaps even within 1 translation unit) depending on context, and that's a good thing for performance.

I actually started with a branch (which i decided not to submit as PR) where I introduced another pragma {.alwaysInline.} with the semantics:

but I realized it's really not the main thing at play here, the main thing was passing --opt:speed (or --passC:-O or even --passC:-Og) made all the difference.

Speed should be the main criterion here, correctness is already guaranteed by calling convention regardless of whether it ends up being inlined or not (eg, there won't be link or multiple definition errors regardless of whether it gets inlined or not).

Araq commented 4 years ago

I have nothing to add really. inline should inline, period. It's a way to override the compiler's optimizer by saying "I do know better, trust me". There are no other reasons for inline's existance.

timotheecour commented 4 years ago

There are no other reasons for inline's existance

yes there is a very important one: it tells the compiler that, when crossing module boundaries, the foreign function may be inlined. With optimizations turned on (eg -d:danger):

here's a minimal test case showing this: git clone https://github.com/timotheecour/vitanim && cd vitanim nim c -r testcases/t10334.nim # runs all tests, see https://github.com/timotheecour/vitanim/blob/master/testcases/t10334.nim

-d:case_normal_same_module: 0.618448226
-d:case_normal: 12.580990924
-d:case_inline: 0.586698406
-d:case_generic: 13.3243305
-d:case_generic_empty: 13.505951988
-d:case_template: 0.605310353

this shows -d:case_normal (no {.inline.}) did not inline despite -d:danger flag, unlike -d:case_inline, resulting in a 20X slowdown

so {.inline.} is very useful to allow compiler to optimize across module boundaries, but mapping {.inline.} to always_inline is not a good idea as it will have negative performance consequences depending on context where inlined function is used.

Otherwise you'll end up in a situation where there are only bad choices:

Araq commented 4 years ago

Well yes, I'm aware of the cross module inlining issue but I hope link-time optimization is good enough these days to make it a non-issue.

Araq commented 4 years ago

As an alternative we could have hot, cold and inline, so today's inline would become hot and inline should always inline.

timotheecour commented 4 years ago

Well yes, I'm aware of the cross module inlining issue but I hope link-time optimization is good enough these days to make it a non-issue.

--passc:-flto (i had originally tried --passl:-flto but that one did not help) improves a lot for cross module inline of procs not marked as {.inline.} but performance is still measurably worse than with {.inline.} procs: it's about 1.35X slower:

-d:case_normal_same_module: 0.5746762430000001
-d:case_normal: 0.7599413070000001
-d:case_inline: 0.5576130500000001
-d:case_generic: 0.7723141130000001
-d:case_generic_empty: 0.7951715500000001
-d:case_template: 0.5691230700000001

furthermore, --passc:-flto slows compile times drastically, by a factor 3X:

nim c -o:/tmp/D20200310T123701 --skipParentCfg --skipUserCfg --hints:off -d:danger -f compiler/nim.nim
13s
nim c -o:/tmp/D20200310T123701 --skipParentCfg --skipUserCfg --hints:off -d:danger --passc:-flto -f compiler/nim.nim
40s

conclusion:

As an alternative we could have hot, cold and inline, so today's inline would become hot and inline should always inline.

I want the other way around, the common case is to let do the C compiler what it's good at (optimizing what should be inlined, based on whole context) and providing an option to force inlining for the very rare case where we want to override the compiler smarts. That's pretty much exactly what my unsubmitted PR is about: see https://github.com/timotheecour/Nim/pull/57 (you can see a clean diff by selecting a range of the last few commits only)

proc fun() {.inline.} = discard # same semantics as before, common case
proc fun() {.alwaysinline.} = discard # forces inlining, mapping to compiler-specific attributes

maybe I can rename it to {.forceinline.}, it's a trivial change Note that alwaysinline as I introduced it does not introduce a new calling convention, it's still using ccInline calling convention, but maps to __attribute__((__always_inline__)) / __forceinline.

Araq commented 4 years ago

No, I cannot provide such an example (for now!) so the status quo seems sufficient. But I personally do not like non-inlined inline functions, I think I know what I'm doing when I use the inline keyword. To fight misapplications of inline I'd rather have better introspection into the asm code and profiling support instead of compilers ignoring my annotations.

furthermore, --passc:-flto slows compile times drastically, by a factor 3X:

Actually a factor of 3 is suprisingly good IMO. And compile-times are insignificant for release builds of software that is shipped to thousands of customers.

timotheecour commented 4 years ago

At least we now agree we need 2 (at least 2) inlining annotations (whether it's my recommendation (hint = {.inline.}, force = {.forceinline.}) or (hint = {.hot.}, force = {.inline.}); for the specific case of allowing cross-module inlining.

I think I know what I'm doing when I use the inline keyword.

pgo for inlining decisions

Nim can learn some way to map a profiler output to force-inline or force-noinline or hint-inline (or other optimization strategies, eg branch prediction / likely / unlikely) on hotspots, that's definitely something that should be done at some point and wouldn't even be too hard to do. All you'd need is:

this can be made robust to avoid having to rerun pgo after source code changes (leading to different mangled cgen'd names), there's lot of possibilities there.

compile-times are insignificant for release builds of software that is shipped to thousands of customers

of course, but that's very use case dependent (eg during development you may want to balance compile time vs performance) And it doesn't always help despite the 3X CT slowdown, eg: I just tried, nim built with -d:danger --passc:-flto has same performance as nim built with -d:danger

links

arnetheduck commented 2 years ago

Nim handles stack traces by explicitly adding redundant code that pushes and pops data, adding code size and memory traffic - in the world of stack traces, this is not at all necessary: the CPU already records all information necessary to generate a stack trace as part of its "normal" control flow implementation: this information can be extracted using https://github.com/status-im/nim-libbacktrace

There's no way around that the current implementation will always be ridiculously slow compared to simply reading the stack with libbacktrace, thus the feature should really focus on debuggability rather than performance and as such, complicating things with special rules for inlining etc seems counterproductive.