timotheecour commented 4 years ago

some of these (especially P1) will result in a large speedup for builds with --stacktrace:off (eg debug builds)

links

originally discussed here https://github.com/nim-lang/Nim/pull/13536#issuecomment-593949915
https://clang.llvm.org/docs/AttributeReference.html#gnu-inline describing gnu_inline semantics
https://stackoverflow.com/questions/25602813/force-a-function-to-be-inline-in-clang-llvm
https://gustedt.wordpress.com/2010/11/29/myth-and-reality-about-inline-in-c99/
D20200303T145713

RFC proposals

[ ] P1 an {.inline.} proc should by default NOT generate nimfr_ + popFrame (regardless of -d:release or not) even with --stacktrace:on. Otherwise --stacktrace:on can be very slow
[ ] P2 a new option should be introduced --stacktrace:inline (implies --stacktrace:on) to generate nimfr_ + popFrame even for {.inline.}
[ ] P3 currently proc fun2x(a: int): int {.inline.} generates:
```
static N_INLINE(NI, fun2x__rxUfcN4ZvcApCx0waSVH8At10303b)(NI a) {...
```
regardless of -d:danger or not according to https://stackoverflow.com/questions/25602813/force-a-function-to-be-inline-in-clang-llvm it seems static should maybe not be used:

So in order to guarantee that the function is inlined: Don’t use static inline. Make sure not to compile with GNU89.

we should investigate to make sure it works as intended, and wheter __attribute__((always_inline)) is more appropriate (cf hints vs guarantees)

[ ] P4 #define N_INLINE_PTR(rettype, name) rettype (*name) is dead code: no use of N_INLINE_PTR (and no idea what it's for)
[ ] P5 we should have a compiler option --ignoreInline way to skip inlining (for debug and maybe even release builds, it's and orthogonal decision), helps w debugging esp stacktraces => this would only control nim code that cgen's inline, and would NOT be conflated with C-specific flags that controls inlining, so as not to interfere with other libraries; user can always pass --passC:XXX in addition to that if needed
[ ] P6 --ignoreInline shall be usable in cmdline as well as in {.push.}/{.pop.} sections just like --rangeChecks + friends
[ ] P7 an imported {.inline.} proc currently gets its C source code duplicated in every module it's used, which could potentially slow down c code compilation (but no need to fix if benchmarking reveals it's not a bottleneck). So it's some kind of double inlining since we already have the N_INLINE macro static N_INLINE(NI, fun2x__rxUfcN4ZvcApCx0waSVH8At10303b)(NI a) {...} Instead we can use what's recommended here https://gustedt.wordpress.com/2010/11/29/myth-and-reality-about-inline-in-c99/ and here https://stackoverflow.com/a/43442230/1426932 and introduce the inline proc definition in a header file, inline fun2x__rxUfcN4ZvcApCx0waSVH8At10303b(NI a) { ... } imported by all clients of it, and then use extern in a single location in a source file

// in inline_definitions.h
inline double dabs(double x) {  return x < 0.0 ? -x : x;}
// in exactly one compilation unit inline_definitions.c
extern inline double dabs(double x);

Nowadays, where the C99 standard still isn’t completely implemented in many compilers, many of them still get it wrong. Gcc corrected its first approach and now (since version 4.2.1) works with the method above. Others that mimic gcc like opencc or icc are still stuck on the old model that gcc had once invented. The include file p99_compiler.h of P99 helps you make sure that your code will work even with these compilers

Clyybber commented 4 years ago

So we make debugging harder, introduce a bunch of new arguments/pragmas that make everything unneccessarily more complex, instead of just using the already existing {.push: stacktraces: off.} when you really need it.

timotheecour commented 4 years ago

so your solution when a program (eg, nim itself) compiled with --stacktrace:on is too slow is to:

find all the bottlenecks (that's really hard, it depends on use cases, and nimprof introduces bias)
for each bottleneck, edit sources and wrap those procs inside {.push: stacktraces: off.} {.push: stacktraces: on.}
recompile
at the end of debug session, remember to undo those edits

You can't just permanently add {.push: stacktraces: off.} in each suspected location where it could be a bottleneck, that would affect debugability of those locations with no easy way to get traces (unless again, you edit the sources).

I don't see how that's practical, and that's obviously much more complex for users than what I've proposed and implemented: simply compile with --stacktrace:on/off/inline depending on what stacktraces you want, this should cover the vast majority of cases because inline procs happen to be the performance sensitive ones (eg very basic things like math operations etc). {.push: stacktraces: on.} is always an option beyond tha (profile based using PGO could also be an option, but that's out of scope for this discussion)

So we make debugging harder

no, there's a mode that preserves pre-existing semantics; it's just a question of whether to default to old mode or to new mode.

Araq commented 4 years ago

All that's really needed IMHO is that we map Nim's .inline to C++'s always_inline.

timotheecour commented 4 years ago

well I've updated https://github.com/nim-lang/Nim/pull/13582 so that existing semantics are unchanged, instead the new --stacktrace:noinline can give you large speedups while typically producing identical stracktrace, ie, a good compromise bw speed and debuggability, see https://github.com/nim-lang/Nim/pull/13582#issuecomment-594617950 for details

timotheecour commented 4 years ago

I'd be happier with templates instead of inline procs as inline procs are currently not inlined in debug builds (from https://github.com/nim-lang/Nim/pull/13536#issuecomment-593911621) All that's really needed IMHO is that we map Nim's .inline to C++'s always_inline.

I did some further investigation, here are the conclusions using clang on OSX (should mostly generalize to gcc on linux):

the main performance issue with debug builds vs non-debug builds is unrelated to how nim compiler treats {.inline.} (the codegen doesn't care whether it's a debug build or -d:release or -d:danger as far as {.inline.} is concerned); it's instead related to lack of c compiler optimization (eg via --passC:-O or --opt:speed); we may want to consider whether standard debug build (nim c -r main.nim) implies --opt:speed by default, but that should be discussed elsewhere)
changing N_INLINE from inline to __attribute__((__always_inline__)) would not improve speed with standard debug build, and in some cases could make things worse (as also noted in other articles eg https://aras-p.info/blog/2017/10/09/Forced-Inlining-Might-Be-Slow/), eg:
```
./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on -f compiler/nim.nim
bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim  => 55 seconds
```

change N_INLINE from `inline` to `attribute((__always_inline__))`

./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on -f compiler/nim.nim bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim => 113.9 seconds


* however, in combination with `--passC:-Og` (equivalent to `--passC:-O1`),  `__attribute__((__always_inline__))`  does do inlining whereas `inline` does not.

* passing `--passC:-O` or `--opt:speed` greatly improves speed of debug builds (builds without -d:release or -d:danger)
./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on --opt:speed -f compiler/nim.nim
bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim  => 19 seconds

here's a minimal but illustrative example:
```nim
when defined case5:
  import times
  when defined case_normal:
    proc fun[T](x, i: T): T = x + i
  when defined case_inline:
    proc fun[T](x, i: T): T {.inline.} = x + i
  when defined case_noinline:
    proc fun[T](x, i: T): T {.noinline.} = x + i
  when defined case_alwaysinline:
    # modify `N_INLINE` in nimbase.h and replace `inline` by `__attribute__((__always_inline__))`
    proc fun[T](x, i: T): T {.inline.} = x + i
  when defined case_template:
    template fun[T](x, i: T): T = x + i

  proc main() =
    var x: uint = 1
    let n: uint = 1_000_000_000
    var i = uint(0)
    while true:
      x = fun(x,i)
      if i == n: break
      i = i + 1
    doAssert x != 1234
  let t = cpuTime()
  main()
  echo cpuTime() - t

when defined case5_runner:
  import os,strformat
  proc fun(opt: string) =
    const nim = getCurrentCompilerExe()
    let input = currentSourcePath
    let cmd = fmt"{nim} c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f {opt} {input}"
    echo cmd
    doAssert execShellCmd(cmd) == 0

  for opt2 in ["", "--passC:-Og", "--passC:-O", "-d:danger"]:
    echo "\nopt2: ", opt2
    fun fmt"-d:case_normal {opt2}"
    fun fmt"-d:case_inline {opt2}"
    fun fmt"-d:case_noinline {opt2}"
    fun fmt"-d:case_template {opt2}"
    # fun fmt"-d:case_alwaysinline {opt2}"

performance numbers for debug builds

as you can see:

with --passC:-O, {.inline.}, {..} and template have all the same performance, while {.noinline.} is predictable much slower in this example
with --passC:-Og, template is the only thing that actually inlines, but if we change __attribute__((__always_inline__)), {.inline.}
without --passC:-O, {.inline.}, {..} and {.noinline.} have all the same (bad) performance, while template has less bad performance

the only difference we'd make with __attribute__((__always_inline__)) instead of inline in N_INLINE in nimbase.h would be for {.inline.} in combination with --passC:-Og, which inlines only for __attribute__((__always_inline__))


opt2:
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal  /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
3.428681
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline  /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
3.402358
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline  /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
3.359203
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template  /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
1.759311

opt2: --passC:-Og /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.979641 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.172885 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.964913 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 9.999999999992654e-07

opt2: --passC:-O /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 9.999999999992654e-07 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.978355 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 2.000000000000265e-06

opt2: -d:danger /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.940134 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 2.000000000000265e-06



* none of this affects the validity of https://github.com/nim-lang/Nim/pull/13582, since https://github.com/nim-lang/Nim/pull/13582 is about improving speed of code with stacktraces even for optimized builds (it improves speed in all optimization modes)

## links
* https://indico.cern.ch/event/386232/sessions/159923/attachments/771039/1057534/always_inline_performance.pdf
* https://aras-p.info/blog/2017/10/09/Forced-Inlining-Might-Be-Slow/

Araq commented 4 years ago

It's not about speed really, .inline should inline. If it results in slower code than we misapplied the .inline pragma. It's about being more predictable.

timotheecour commented 4 years ago

It's not about speed really, .inline should inline. If it results in slower code than we misapplied the .inline pragma. It's about being more predictable.

this would be bad for performance. It's not as simple as that; source code for a proc declaration has very limited context to decide on whether to inline, unlike C compiler which has much more relevant context (compile flags, params resolved to concrete types, candidate insertion inside a inlining context); hence those articles I linked in previous post which show that force_inline may be slower than inline.

In fact, the same inline proc can end up being inlined or not in different translation units (perhaps even within 1 translation unit) depending on context, and that's a good thing for performance.

I actually started with a branch (which i decided not to submit as PR) where I introduced another pragma {.alwaysInline.} with the semantics:

nim {.inline.} => C inline (same semantics, it's just a hint; compiler is free to ignore depending on context, flags)
nim {.alwaysInline.} => Cattribute((__always_inline__))`` (forces inline except in specific conditions eg recursive calls etc)

but I realized it's really not the main thing at play here, the main thing was passing --opt:speed (or --passC:-O or even --passC:-Og) made all the difference.

Speed should be the main criterion here, correctness is already guaranteed by calling convention regardless of whether it ends up being inlined or not (eg, there won't be link or multiple definition errors regardless of whether it gets inlined or not).

Araq commented 4 years ago

I have nothing to add really. inline should inline, period. It's a way to override the compiler's optimizer by saying "I do know better, trust me". There are no other reasons for inline's existance.

timotheecour commented 4 years ago

There are no other reasons for inline's existance

yes there is a very important one: it tells the compiler that, when crossing module boundaries, the foreign function may be inlined. With optimizations turned on (eg -d:danger):

a proc without {.inline.} in module B cannot be inlined in code inside module A
a proc without {.inline.} in module A can be inlined in code inside module A
proc with {.inline.} in module B can be inlined in code inside module A

here's a minimal test case showing this: git clone https://github.com/timotheecour/vitanim && cd vitanim nim c -r testcases/t10334.nim # runs all tests, see https://github.com/timotheecour/vitanim/blob/master/testcases/t10334.nim

-d:case_normal_same_module: 0.618448226
-d:case_normal: 12.580990924
-d:case_inline: 0.586698406
-d:case_generic: 13.3243305
-d:case_generic_empty: 13.505951988
-d:case_template: 0.605310353

this shows -d:case_normal (no {.inline.}) did not inline despite -d:danger flag, unlike -d:case_inline, resulting in a 20X slowdown

so {.inline.} is very useful to allow compiler to optimize across module boundaries, but mapping {.inline.} to always_inline is not a good idea as it will have negative performance consequences depending on context where inlined function is used.

Otherwise you'll end up in a situation where there are only bad choices:

either add {.inline.} and the implied always_inline will bite you depending on circumstances (eg 2X slower than not mapping to always_inline for a debug build of nim compiler binary itself; but there are other cases involving code with optimizations on; see https://github.com/nim-lang/RFCs/issues/198#issuecomment-596407192)
or skip {.inline.} and you'll run into the above mentioned issue where performance suffers in t10334.nim (20X slower)

Araq commented 4 years ago

Well yes, I'm aware of the cross module inlining issue but I hope link-time optimization is good enough these days to make it a non-issue.

Araq commented 4 years ago

As an alternative we could have hot, cold and inline, so today's inline would become hot and inline should always inline.

timotheecour commented 4 years ago

Well yes, I'm aware of the cross module inlining issue but I hope link-time optimization is good enough these days to make it a non-issue.

--passc:-flto (i had originally tried --passl:-flto but that one did not help) improves a lot for cross module inline of procs not marked as {.inline.} but performance is still measurably worse than with {.inline.} procs: it's about 1.35X slower:

-d:case_normal_same_module: 0.5746762430000001
-d:case_normal: 0.7599413070000001
-d:case_inline: 0.5576130500000001
-d:case_generic: 0.7723141130000001
-d:case_generic_empty: 0.7951715500000001
-d:case_template: 0.5691230700000001

furthermore, --passc:-flto slows compile times drastically, by a factor 3X:

nim c -o:/tmp/D20200310T123701 --skipParentCfg --skipUserCfg --hints:off -d:danger -f compiler/nim.nim
13s
nim c -o:/tmp/D20200310T123701 --skipParentCfg --skipUserCfg --hints:off -d:danger --passc:-flto -f compiler/nim.nim
40s

conclusion:

--passc:-flto can improve runtime performance considerably (here by 15) and should be featured prominently in docs (maybe this could be added to a new optimization flag --opt:aggressive along with other relevant flags, with meaning of: sacrifice CT performance for runtime performance)
--passc:-flto increases CT time by 3X
--passc:-flto runtime performance is still not on par with having marked procs as {.inline.} even in this simple example (1.35X slower)

As an alternative we could have hot, cold and inline, so today's inline would become hot and inline should always inline.

changing {.inline.}'s meaning is too disruptive and a performance breaking change, too much code out there relies on {.inline.}; you'd end up with code that runs fast with nim 1.0.X and slow on nim devel or vice versa
this pretty much guarantees people will abuse {.inline.} thinking they can outsmart the C compiler optimizer, and not understanding the subtleties regarding inlining.

I want the other way around, the common case is to let do the C compiler what it's good at (optimizing what should be inlined, based on whole context) and providing an option to force inlining for the very rare case where we want to override the compiler smarts. That's pretty much exactly what my unsubmitted PR is about: see https://github.com/timotheecour/Nim/pull/57 (you can see a clean diff by selecting a range of the last few commits only)

proc fun() {.inline.} = discard # same semantics as before, common case
proc fun() {.alwaysinline.} = discard # forces inlining, mapping to compiler-specific attributes

maybe I can rename it to {.forceinline.}, it's a trivial change Note that alwaysinline as I introduced it does not introduce a new calling convention, it's still using ccInline calling convention, but maps to __attribute__((__always_inline__)) / __forceinline.

finally, before I can even send out this PR, can you come up with a simple example that shows that performance improves by changing inline to __attribute__((__always_inline__)) in nimbase.h ? (ie, running nim c -d:danger main would become faster for some simple main.nim simply by changing that line in nimbase.h)

Araq commented 4 years ago

No, I cannot provide such an example (for now!) so the status quo seems sufficient. But I personally do not like non-inlined inline functions, I think I know what I'm doing when I use the inline keyword. To fight misapplications of inline I'd rather have better introspection into the asm code and profiling support instead of compilers ignoring my annotations.

furthermore, --passc:-flto slows compile times drastically, by a factor 3X:

Actually a factor of 3 is suprisingly good IMO. And compile-times are insignificant for release builds of software that is shipped to thousands of customers.

timotheecour commented 4 years ago

At least we now agree we need 2 (at least 2) inlining annotations (whether it's my recommendation (hint = {.inline.}, force = {.forceinline.}) or (hint = {.hot.}, force = {.inline.}); for the specific case of allowing cross-module inlining.

I think I know what I'm doing when I use the inline keyword.

you may, but plenty won't and there's no escaping of the fact that depending on context, inlining may or may not be desirable for a given proc, which is why in 99% cases leaving it to optimizing C compiler is the right approach (while allowing it to, via {.inline.}) That's especially true for general purpose procs (stdlib or thirdparty), which can be used in widely different contexts. It could even depend on types for generic procs!
one specific case where we can outsmart C compiler is via profiling (eg PGO) for a specific application / input distribution etc, but the corresponding inline/noinline suggestion shouldn't be hard-coded in the source code.

pgo for inlining decisions

Nim can learn some way to map a profiler output to force-inline or force-noinline or hint-inline (or other optimization strategies, eg branch prediction / likely / unlikely) on hotspots, that's definitely something that should be done at some point and wouldn't even be too hard to do. All you'd need is:

compile code with nim c -r --pgo:outputfile main args... # generate PGO / profiling output on specified test case
compile code using generated profile: nim c -r --pgoUse:inputfile main args...

this can be made robust to avoid having to rerun pgo after source code changes (leading to different mangled cgen'd names), there's lot of possibilities there.

compile-times are insignificant for release builds of software that is shipped to thousands of customers

of course, but that's very use case dependent (eg during development you may want to balance compile time vs performance) And it doesn't always help despite the 3X CT slowdown, eg: I just tried, nim built with -d:danger --passc:-flto has same performance as nim built with -d:danger

links

arnetheduck commented 2 years ago

Nim handles stack traces by explicitly adding redundant code that pushes and pops data, adding code size and memory traffic - in the world of stack traces, this is not at all necessary: the CPU already records all information necessary to generate a stack trace as part of its "normal" control flow implementation: this information can be extracted using https://github.com/status-im/nim-libbacktrace

There's no way around that the current implementation will always be ridiculously slow compared to simply reading the stack with libbacktrace, thus the feature should really focus on debuggability rather than performance and as such, complicating things with special rules for inlining etc seems counterproductive.

nim-lang / RFCs

[RFC] [performance] {.inline.} guarantees, `--stacktrace:inline`, and `--ignoreInline` #198

links

RFC proposals

change N_INLINE from `inline` to `attribute((__always_inline__))`

performance numbers for debug builds

pgo for inlining decisions

links

nim-lang / RFCs

[RFC] [performance] {.inline.} guarantees, `--stacktrace:inline`, and `--ignoreInline` #198

links

RFC proposals

change N_INLINE from inline to __attribute__((__always_inline__))

performance numbers for debug builds

pgo for inlining decisions

links

change N_INLINE from `inline` to `attribute((__always_inline__))`