Open oliv3r opened 1 month ago
Generally the idea is 'do not inline, let the compiler figure it out, it knows best.
No, because "best" is not a well-defined term here. You see that -O3
and -Os
define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for -Os
is the MCU world where you may really still be restricted to a few KB of program memory.
gravityDB_finalize_client_statements(*client)
involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.
At this point, FTL isn't meant to be compiled with -Os
and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.
TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.
*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable
Generally the idea is 'do not inline, let the compiler figure it out, it knows best.
No, because "best" is not a well-defined term here. You see that
-O3
and-Os
define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for-Os
is the MCU world where you may really still be restricted to a few KB of program memory.
I beg to differ, but your point does stand. My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is. Also, -Os
has an more important impact, which is caching. Caches is still limited in a lot of CPU's because space for cache is expensive. Also being more effective in cache usage does improve speed and performance. Benchmarks would need to prove this of course.
So there certainly is still usage for Os in certain scenario's. However as I said, your point does stand and I do agree with it for the most part. ;)
gravityDB_finalize_client_statements(*client)
involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.
I'm more then happy to accept that you know what you are doing :) no need to prove anything :p
At this point, FTL isn't meant to be compiled with
-Os
and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.
And that's perfectly fine! So the answer to my question is 'no, we know what we are doing'. And that's great :) I should have potentially given more context, and that is I was trying to use the CMake target MinSizeRel
because for the some parts, (openwrt, alpine, etc) it's the commonly picked target. This lead me down a path of compilation errors, where compilation was no longer possible.
That means, that https://github.com/pi-hole/FTL/blob/8943e26041c8730a5a9060eb64f76c6c2dd4e458/src/CMakeLists.txt#L106 needs to be either removed, as it's not even valid, or (a bit better imo) is to either disable -Winline
for that target (could we put -Wnonline
there? or increase the inline warning limit (to which I had no success with so far).
TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.
Well then you conclude say we should actually remove MinSizeRel
!
*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable
:+1:
My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is.
Ah, yes. We do not "support" any embedded devices out of the box so that's why I haven't thought about them. Sure, there are so many different devices out there that not everyone will have a USB port or whatsover with which you could extend storage for the hundreds GBs with only couple of dollars. This context indeed sheds another light on your endeavor - one which I can easily follow.
Well then you conclude say we should actually remove
MinSizeRel
!
I have no objections. A quick look on git blame
on this line reveals it was an external contribution and (assumption ahead!) probably originated from some default template just as the other two (DEBUG
and RELEASE
).
The Pi-hole team itself solely uses the RELWITHDEBINFO
target both with our internal tooling as well as with the recommended compiling steps packed together into build.sh
. I'd be ready to say that MinSizeRel
was probably tried back when this was added but never since again.
Hi, (newbiew here) Why aren't we using -Ofast instead of -O3 ?
@DL6ER so for me, I'm just building with Release
and I'm done; but I suppose you guys have to make a choice here. Or at least have test targets that build them ;)
So I'll leave this issue open and you can close it with whatever solution you feel fits best.
I'm just catching up on what is open here, it's been some busy weeks, sorry for the huge delay in replying.
Why aren't we using -Ofast instead of -O3 ?
Let me quote the gcc
documentation on optimization settings directly:
Disregard strict standards compliance.
-Ofast
enables all-O3
optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on-ffast-math
,-fallow-store-data-races
and the Fortran-specific-fstack-arrays
, unless-fmax-stack-var-size
is specified, and-fno-protect-parens
. It turns off-fsemantic-interposition
.
It means -O3
plus some heavy math optimizations that makes floating point computations faster sacrificing accuracy. While it'd probably be fine to use it in Pi-hole, I doubt it'd bring up any real speed benefit as FTL is not a scientific (= number-crunching) application. Floating point arithmetic really makes up less than 1% of the run-time and only this less than 1% would be made slightly faster (while, saying it again, sacrificing accuracy). I don't think this is justified.
So I'll leave this issue open and you can close it with whatever solution you feel fits best.
I'm open to anything. We want to keep the CI targets manageable - mostly because we run them partially on our own hardware as Github doesn't offer powerful ARM targets (the Mac targets are surprisingly slow). So I'd strip out anything besides Release
, ReleaseWithDebug
and Debug
(while we'd still be testing ReleaseWithDebug
only because that's what we want to ship).
This issue is stale because it has been open 30 days with no activity. Please comment or update this issue or it will be closed in 5 days.
Versions
Compiling FTL with a recent GCC and the 'MinSizeRel' CMake target (e.g. -Os) triggers a lot of
like errors.
Here, GCC is basically telling us 'look, you wanted to inline this, but it's less efficient but you are asking to do
-Os
.Should we force inline functions at all, or should gcc figure things out by itself with
-Os
and-O3
. Generally the idea is 'do not inline, let the compiler figure it out, it knows best.