Open timotheecour opened 4 years ago
So we make debugging harder, introduce a bunch of new arguments/pragmas that make everything unneccessarily more complex, instead of just using the already existing {.push: stacktraces: off.}
when you really need it.
so your solution when a program (eg, nim itself) compiled with --stacktrace:on is too slow is to:
{.push: stacktraces: off.}
{.push: stacktraces: on.}
You can't just permanently add {.push: stacktraces: off.}
in each suspected location where it could be a bottleneck, that would affect debugability of those locations with no easy way to get traces (unless again, you edit the sources).
I don't see how that's practical, and that's obviously much more complex for users than what I've proposed and implemented: simply compile with --stacktrace:on/off/inline
depending on what stacktraces you want, this should cover the vast majority of cases because inline procs happen to be the performance sensitive ones (eg very basic things like math operations etc).
{.push: stacktraces: on.}
is always an option beyond tha (profile based using PGO could also be an option, but that's out of scope for this discussion)
So we make debugging harder
no, there's a mode that preserves pre-existing semantics; it's just a question of whether to default to old mode or to new mode.
All that's really needed IMHO is that we map Nim's .inline to C++'s always_inline.
well I've updated https://github.com/nim-lang/Nim/pull/13582 so that existing semantics are unchanged, instead the new --stacktrace:noinline can give you large speedups while typically producing identical stracktrace, ie, a good compromise bw speed and debuggability, see https://github.com/nim-lang/Nim/pull/13582#issuecomment-594617950 for details
I'd be happier with templates instead of inline procs as inline procs are currently not inlined in debug builds (from https://github.com/nim-lang/Nim/pull/13536#issuecomment-593911621) All that's really needed IMHO is that we map Nim's .inline to C++'s always_inline.
I did some further investigation, here are the conclusions using clang on OSX (should mostly generalize to gcc on linux):
{.inline.}
(the codegen doesn't care whether it's a debug build or -d:release or -d:danger as far as {.inline.}
is concerned); it's instead related to lack of c compiler optimization (eg via --passC:-O
or --opt:speed
); we may want to consider whether standard debug build (nim c -r main.nim
) implies --opt:speed
by default, but that should be discussed elsewhere)N_INLINE
from inline
to __attribute__((__always_inline__))
would not improve speed with standard debug build, and in some cases could make things worse (as also noted in other articles eg https://aras-p.info/blog/2017/10/09/Forced-Inlining-Might-Be-Slow/), eg:
./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on -f compiler/nim.nim
bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim => 55 seconds
inline
to __attribute__((__always_inline__))
./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on -f compiler/nim.nim bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim => 113.9 seconds
* however, in combination with `--passC:-Og` (equivalent to `--passC:-O1`), `__attribute__((__always_inline__))` does do inlining whereas `inline` does not.
* passing `--passC:-O` or `--opt:speed` greatly improves speed of debug builds (builds without -d:release or -d:danger)
./bin/nim c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp1 --stacktrace:on --opt:speed -f compiler/nim.nim
bin/nim_temp1 c --skipParentCfg --skipUserCfg --hints:off -o:bin/nim_temp2 -f --stacktrace:off compiler/nim.nim => 19 seconds
here's a minimal but illustrative example:
```nim
when defined case5:
import times
when defined case_normal:
proc fun[T](x, i: T): T = x + i
when defined case_inline:
proc fun[T](x, i: T): T {.inline.} = x + i
when defined case_noinline:
proc fun[T](x, i: T): T {.noinline.} = x + i
when defined case_alwaysinline:
# modify `N_INLINE` in nimbase.h and replace `inline` by `__attribute__((__always_inline__))`
proc fun[T](x, i: T): T {.inline.} = x + i
when defined case_template:
template fun[T](x, i: T): T = x + i
proc main() =
var x: uint = 1
let n: uint = 1_000_000_000
var i = uint(0)
while true:
x = fun(x,i)
if i == n: break
i = i + 1
doAssert x != 1234
let t = cpuTime()
main()
echo cpuTime() - t
when defined case5_runner:
import os,strformat
proc fun(opt: string) =
const nim = getCurrentCompilerExe()
let input = currentSourcePath
let cmd = fmt"{nim} c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f {opt} {input}"
echo cmd
doAssert execShellCmd(cmd) == 0
for opt2 in ["", "--passC:-Og", "--passC:-O", "-d:danger"]:
echo "\nopt2: ", opt2
fun fmt"-d:case_normal {opt2}"
fun fmt"-d:case_inline {opt2}"
fun fmt"-d:case_noinline {opt2}"
fun fmt"-d:case_template {opt2}"
# fun fmt"-d:case_alwaysinline {opt2}"
as you can see:
--passC:-O
, {.inline.}
, {..}
and template
have all the same performance, while {.noinline.}
is predictable much slower in this example--passC:-Og
, template
is the only thing that actually inlines, but if we change __attribute__((__always_inline__))
, {.inline.}
--passC:-O
, {.inline.}
, {..}
and {.noinline.}
have all the same (bad) performance, while template
has less bad performance__attribute__((__always_inline__))
instead of inline
in N_INLINE
in nimbase.h
would be for {.inline.} in combination with --passC:-Og
, which inlines only for __attribute__((__always_inline__))
opt2:
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
3.428681
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
3.402358
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
3.359203
/Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim
1.759311
opt2: --passC:-Og /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.979641 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.172885 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.964913 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template --passC:-Og /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 9.999999999992654e-07
opt2: --passC:-O /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 9.999999999992654e-07 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.978355 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template --passC:-O /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 2.000000000000265e-06
opt2: -d:danger /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_normal -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_inline -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 1.000000000000133e-06 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_noinline -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 0.940134 /Users/timothee/git_clone/nim/Nim_prs/bin/nim.devel c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f -d:case_template -d:danger /Users/timothee/git_clone/nim/timn/tests/nim/all/t10303.nim 2.000000000000265e-06
* none of this affects the validity of https://github.com/nim-lang/Nim/pull/13582, since https://github.com/nim-lang/Nim/pull/13582 is about improving speed of code with stacktraces even for optimized builds (it improves speed in all optimization modes)
## links
* https://indico.cern.ch/event/386232/sessions/159923/attachments/771039/1057534/always_inline_performance.pdf
* https://aras-p.info/blog/2017/10/09/Forced-Inlining-Might-Be-Slow/
It's not about speed really, .inline
should inline. If it results in slower code than we misapplied the .inline
pragma. It's about being more predictable.
It's not about speed really, .inline should inline. If it results in slower code than we misapplied the .inline pragma. It's about being more predictable.
this would be bad for performance. It's not as simple as that; source code for a proc declaration has very limited context to decide on whether to inline, unlike C compiler which has much more relevant context (compile flags, params resolved to concrete types, candidate insertion inside a inlining context); hence those articles I linked in previous post which show that force_inline may be slower than inline
.
In fact, the same inline
proc can end up being inlined or not in different translation units (perhaps even within 1 translation unit) depending on context, and that's a good thing for performance.
I actually started with a branch (which i decided not to submit as PR) where I introduced another pragma {.alwaysInline.}
with the semantics:
C inline
(same semantics, it's just a hint; compiler is free to ignore depending on context, flags)C
attribute((__always_inline__))`` (forces inline except in specific conditions eg recursive calls etc)but I realized it's really not the main thing at play here, the main thing was passing --opt:speed (or --passC:-O or even --passC:-Og
) made all the difference.
Speed should be the main criterion here, correctness is already guaranteed by calling convention regardless of whether it ends up being inlined or not (eg, there won't be link or multiple definition errors regardless of whether it gets inlined or not).
I have nothing to add really. inline should inline, period. It's a way to override the compiler's optimizer by saying "I do know better, trust me". There are no other reasons for inline's existance.
There are no other reasons for inline's existance
yes there is a very important one: it tells the compiler that, when crossing module boundaries, the foreign function may be inlined. With optimizations turned on (eg -d:danger):
here's a minimal test case showing this: git clone https://github.com/timotheecour/vitanim && cd vitanim nim c -r testcases/t10334.nim # runs all tests, see https://github.com/timotheecour/vitanim/blob/master/testcases/t10334.nim
-d:case_normal_same_module: 0.618448226
-d:case_normal: 12.580990924
-d:case_inline: 0.586698406
-d:case_generic: 13.3243305
-d:case_generic_empty: 13.505951988
-d:case_template: 0.605310353
this shows -d:case_normal
(no {.inline.}) did not inline despite -d:danger flag, unlike -d:case_inline
, resulting in a 20X slowdown
so {.inline.}
is very useful to allow compiler to optimize across module boundaries, but mapping {.inline.}
to always_inline
is not a good idea as it will have negative performance consequences depending on context where inlined function is used.
Otherwise you'll end up in a situation where there are only bad choices:
t10334.nim
(20X slower)Well yes, I'm aware of the cross module inlining issue but I hope link-time optimization is good enough these days to make it a non-issue.
As an alternative we could have hot
, cold
and inline
, so today's inline
would become hot
and inline
should always inline.
Well yes, I'm aware of the cross module inlining issue but I hope link-time optimization is good enough these days to make it a non-issue.
--passc:-flto
(i had originally tried --passl:-flto
but that one did not help) improves a lot for cross module inline of procs not marked as {.inline.}
but performance is still measurably worse than with {.inline.}
procs: it's about 1.35X slower:
-d:case_normal_same_module: 0.5746762430000001
-d:case_normal: 0.7599413070000001
-d:case_inline: 0.5576130500000001
-d:case_generic: 0.7723141130000001
-d:case_generic_empty: 0.7951715500000001
-d:case_template: 0.5691230700000001
furthermore, --passc:-flto
slows compile times drastically, by a factor 3X:
nim c -o:/tmp/D20200310T123701 --skipParentCfg --skipUserCfg --hints:off -d:danger -f compiler/nim.nim
13s
nim c -o:/tmp/D20200310T123701 --skipParentCfg --skipUserCfg --hints:off -d:danger --passc:-flto -f compiler/nim.nim
40s
conclusion:
--passc:-flto
can improve runtime performance considerably (here by 15) and should be featured prominently in docs (maybe this could be added to a new optimization flag --opt:aggressive
along with other relevant flags, with meaning of: sacrifice CT performance for runtime performance)--passc:-flto
increases CT time by 3X--passc:-flto
runtime performance is still not on par with having marked procs as {.inline.}
even in this simple example (1.35X slower)As an alternative we could have hot, cold and inline, so today's inline would become hot and inline should always inline.
{.inline.}
's meaning is too disruptive and a performance breaking change, too much code out there relies on {.inline.}
; you'd end up with code that runs fast with nim 1.0.X and slow on nim devel or vice versa{.inline.}
thinking they can outsmart the C compiler optimizer, and not understanding the subtleties regarding inlining.I want the other way around, the common case is to let do the C compiler what it's good at (optimizing what should be inlined, based on whole context) and providing an option to force inlining for the very rare case where we want to override the compiler smarts. That's pretty much exactly what my unsubmitted PR is about: see https://github.com/timotheecour/Nim/pull/57 (you can see a clean diff by selecting a range of the last few commits only)
proc fun() {.inline.} = discard # same semantics as before, common case
proc fun() {.alwaysinline.} = discard # forces inlining, mapping to compiler-specific attributes
maybe I can rename it to {.forceinline.}, it's a trivial change
Note that alwaysinline
as I introduced it does not introduce a new calling convention, it's still using ccInline
calling convention, but maps to __attribute__((__always_inline__))
/ __forceinline
.
inline
to __attribute__((__always_inline__))
in nimbase.h ?
(ie, running nim c -d:danger main
would become faster for some simple main.nim simply by changing that line in nimbase.h)No, I cannot provide such an example (for now!) so the status quo seems sufficient. But I personally do not like non-inlined inline functions, I think I know what I'm doing when I use the inline
keyword. To fight misapplications of inline I'd rather have better introspection into the asm code and profiling support instead of compilers ignoring my annotations.
furthermore, --passc:-flto slows compile times drastically, by a factor 3X:
Actually a factor of 3 is suprisingly good IMO. And compile-times are insignificant for release builds of software that is shipped to thousands of customers.
At least we now agree we need 2 (at least 2) inlining annotations (whether it's my recommendation (hint = {.inline.}, force = {.forceinline.}) or (hint = {.hot.}, force = {.inline.}); for the specific case of allowing cross-module inlining.
I think I know what I'm doing when I use the inline keyword.
you may, but plenty won't and there's no escaping of the fact that depending on context, inlining may or may not be desirable for a given proc, which is why in 99% cases leaving it to optimizing C compiler is the right approach (while allowing it to, via {.inline.}) That's especially true for general purpose procs (stdlib or thirdparty), which can be used in widely different contexts. It could even depend on types for generic procs!
one specific case where we can outsmart C compiler is via profiling (eg PGO) for a specific application / input distribution etc, but the corresponding inline/noinline suggestion shouldn't be hard-coded in the source code.
Nim can learn some way to map a profiler output to force-inline or force-noinline or hint-inline (or other optimization strategies, eg branch prediction / likely / unlikely) on hotspots, that's definitely something that should be done at some point and wouldn't even be too hard to do. All you'd need is:
nim c -r --pgo:outputfile main args...
# generate PGO / profiling output on specified test casenim c -r --pgoUse:inputfile main args...
this can be made robust to avoid having to rerun pgo after source code changes (leading to different mangled cgen'd names), there's lot of possibilities there.
compile-times are insignificant for release builds of software that is shipped to thousands of customers
of course, but that's very use case dependent (eg during development you may want to balance compile time vs performance)
And it doesn't always help despite the 3X CT slowdown, eg: I just tried, nim built with -d:danger --passc:-flto
has same performance as nim built with -d:danger
Nim handles stack traces by explicitly adding redundant code that pushes and pops data, adding code size and memory traffic - in the world of stack traces, this is not at all necessary: the CPU already records all information necessary to generate a stack trace as part of its "normal" control flow implementation: this information can be extracted using https://github.com/status-im/nim-libbacktrace
There's no way around that the current implementation will always be ridiculously slow compared to simply reading the stack with libbacktrace, thus the feature should really focus on debuggability rather than performance and as such, complicating things with special rules for inlining etc seems counterproductive.
some of these (especially P1) will result in a large speedup for builds with --stacktrace:off (eg debug builds)
links
RFC proposals
[ ] P1 an {.inline.} proc should by default NOT generate
nimfr_
+popFrame
(regardless of -d:release or not) even with --stacktrace:on. Otherwise --stacktrace:on can be very slow[ ] P2 a new option should be introduced
--stacktrace:inline
(implies--stacktrace:on
) to generatenimfr_
+popFrame
even for{.inline.}
[ ] P3 currently
proc fun2x(a: int): int {.inline.}
generates:regardless of
-d:danger
or not according to https://stackoverflow.com/questions/25602813/force-a-function-to-be-inline-in-clang-llvm it seemsstatic
should maybe not be used:we should investigate to make sure it works as intended, and wheter
__attribute__((always_inline))
is more appropriate (cf hints vs guarantees)[ ] P4
#define N_INLINE_PTR(rettype, name) rettype (*name)
is dead code: no use ofN_INLINE_PTR
(and no idea what it's for)[ ] P5 we should have a compiler option
--ignoreInline
way to skip inlining (for debug and maybe even release builds, it's and orthogonal decision), helps w debugging esp stacktraces => this would only control nim code that cgen's inline, and would NOT be conflated with C-specific flags that controls inlining, so as not to interfere with other libraries; user can always pass--passC:XXX
in addition to that if needed[ ] P6
--ignoreInline
shall be usable in cmdline as well as in {.push.}/{.pop.} sections just like--rangeChecks
+ friends[ ] P7 an imported {.inline.} proc currently gets its C source code duplicated in every module it's used, which could potentially slow down c code compilation (but no need to fix if benchmarking reveals it's not a bottleneck). So it's some kind of double inlining since we already have the
N_INLINE
macrostatic N_INLINE(NI, fun2x__rxUfcN4ZvcApCx0waSVH8At10303b)(NI a) {...}
Instead we can use what's recommended here https://gustedt.wordpress.com/2010/11/29/myth-and-reality-about-inline-in-c99/ and here https://stackoverflow.com/a/43442230/1426932 and introduce the inline proc definition in a header file,inline fun2x__rxUfcN4ZvcApCx0waSVH8At10303b(NI a) { ... }
imported by all clients of it, and then useextern
in a single location in a source file