Overzealous inline opitimizations?

oliv3r commented 1 month ago

Versions

FTL: 5.25.2

Compiling FTL with a recent GCC and the 'MinSizeRel' CMake target (e.g. -Os) triggers a lot of

FTL-5.25.2/src/database/gravity-db.c:944:20: error: inlining failed in call to 'gravityDB_finalize_client_statements': call is unlikely and code size would grow [-Werror=inline]
  944 | static inline void gravityDB_finalize_client_statements(clientsData *client)
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/workdir/src/FTL-5.25.2/src/database/gravity-db.c:1252:9: note: called from here
 1252 |         gravityDB_finalize_client_statements(client);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

like errors.

Here, GCC is basically telling us 'look, you wanted to inline this, but it's less efficient but you are asking to do -Os.

Should we force inline functions at all, or should gcc figure things out by itself with -Os and -O3. Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

DL6ER commented 1 month ago

Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

No, because "best" is not a well-defined term here. You see that -O3 and -Os define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for -Os is the MCU world where you may really still be restricted to a few KB of program memory.

gravityDB_finalize_client_statements(*client) involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.

At this point, FTL isn't meant to be compiled with -Os and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.

TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.

*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable

oliv3r commented 1 month ago

Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

No, because "best" is not a well-defined term here. You see that -O3 and -Os define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for -Os is the MCU world where you may really still be restricted to a few KB of program memory.

I beg to differ, but your point does stand. My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is. Also, -Os has an more important impact, which is caching. Caches is still limited in a lot of CPU's because space for cache is expensive. Also being more effective in cache usage does improve speed and performance. Benchmarks would need to prove this of course.

So there certainly is still usage for Os in certain scenario's. However as I said, your point does stand and I do agree with it for the most part. ;)

gravityDB_finalize_client_statements(*client) involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.

I'm more then happy to accept that you know what you are doing :) no need to prove anything :p

At this point, FTL isn't meant to be compiled with -Os and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.

And that's perfectly fine! So the answer to my question is 'no, we know what we are doing'. And that's great :) I should have potentially given more context, and that is I was trying to use the CMake target MinSizeRel because for the some parts, (openwrt, alpine, etc) it's the commonly picked target. This lead me down a path of compilation errors, where compilation was no longer possible.

That means, that https://github.com/pi-hole/FTL/blob/8943e26041c8730a5a9060eb64f76c6c2dd4e458/src/CMakeLists.txt#L106 needs to be either removed, as it's not even valid, or (a bit better imo) is to either disable -Winline for that target (could we put -Wnonline there? or increase the inline warning limit (to which I had no success with so far).

TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.

Well then you conclude say we should actually remove MinSizeRel!

*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable

:+1:

DL6ER commented 1 month ago

My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is.

Ah, yes. We do not "support" any embedded devices out of the box so that's why I haven't thought about them. Sure, there are so many different devices out there that not everyone will have a USB port or whatsover with which you could extend storage for the hundreds GBs with only couple of dollars. This context indeed sheds another light on your endeavor - one which I can easily follow.

Well then you conclude say we should actually remove MinSizeRel!

I have no objections. A quick look on git blame on this line reveals it was an external contribution and (assumption ahead!) probably originated from some default template just as the other two (DEBUG and RELEASE).

The Pi-hole team itself solely uses the RELWITHDEBINFO target both with our internal tooling as well as with the recommended compiling steps packed together into build.sh. I'd be ready to say that MinSizeRel was probably tried back when this was added but never since again.

kocburak commented 1 month ago

Hi, (newbiew here) Why aren't we using -Ofast instead of -O3 ?

oliv3r commented 1 month ago

@DL6ER so for me, I'm just building with Release and I'm done; but I suppose you guys have to make a choice here. Or at least have test targets that build them ;)

So I'll leave this issue open and you can close it with whatever solution you feel fits best.

DL6ER commented 1 month ago

I'm just catching up on what is open here, it's been some busy weeks, sorry for the huge delay in replying.

Why aren't we using -Ofast instead of -O3 ?

Let me quote the gcc documentation on optimization settings directly:

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. It turns off -fsemantic-interposition.

It means -O3 plus some heavy math optimizations that makes floating point computations faster sacrificing accuracy. While it'd probably be fine to use it in Pi-hole, I doubt it'd bring up any real speed benefit as FTL is not a scientific (= number-crunching) application. Floating point arithmetic really makes up less than 1% of the run-time and only this less than 1% would be made slightly faster (while, saying it again, sacrificing accuracy). I don't think this is justified.

So I'll leave this issue open and you can close it with whatever solution you feel fits best.

I'm open to anything. We want to keep the CI targets manageable - mostly because we run them partially on our own hardware as Github doesn't offer powerful ARM targets (the Mac targets are surprisingly slow). So I'd strip out anything besides Release, ReleaseWithDebug and Debug (while we'd still be testing ReleaseWithDebug only because that's what we want to ship).

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 30 days with no activity. Please comment or update this issue or it will be closed in 5 days.

pi-hole / FTL

Overzealous inline opitimizations? #1952

Versions