__ir_pure seems much more expensive in large projects

p0nce commented 1 year ago

Reference: https://github.com/AuburnSounds/intel-intrinsics/issues/130

Problem statement It seems some functions that instantiate a template get a lot more expensive to generate code for, in larger projects. EDIT: actually, it seems everything that uses __ir_pure pays a (growing) cost

Example:

_mm_unpacklo_epi8 takes 1ms 45us in https://github.com/AuburnSounds/intel-intrinsics/issues/130
but in a larger project, the same kind of function, such as _mm_unpackhi_ps, takes more than 50ms.

How to reproduce You can reproduce by building this project: https://github.com/AuburnSounds/Dplug/tree/master/examples/clipit with LDC 1.32.1 and --ftime-trace (type dub --combined). All intrinsics with __ir_pure take about 50ms. intel-intrinsics then become a very significant contributor to total build times (this is not the -g regression).

_At first I thought this was all about ldc.simd.shufflevector having too many CTFE, when when precomputing the LLVM IR and using __ir_pure instead the performance of build is the same, or even reduced._

p0nce commented 1 year ago

And that can be avoided by not using __ir_pure

kinke commented 8 months ago

I've just stumbled upon: https://github.com/ldc-developers/ldc/blob/3eb31901b3d253b3f23b8599111d930994696ef7/gen/toir.cpp#L718

This means that the IR fragment is generated (in its own IR module) and then 'linked' for every call site, instead of once per template instantiation of __ir[Ex][_pure]. I'm pretty sure this makes it extremely slow.

kinke commented 8 months ago

Confirmed:

import ldc.llvmasm;

version (all)
{
    // fast variant - IR-inlining once
    pragma(inline, true)
    double muladdFast(double a, double b, double c)
    {
        return __ir!(`%p = fmul fast double %0, %1
                             %r = fadd fast double %p, %2
                             ret double %r`,
                             double, double, double, double)(a, b, c);
    }
}
else
{
    // slow variant - IR-inlining at every call site
    alias muladdFast = __ir!(`%p = fmul fast double %0, %1
                             %r = fadd fast double %p, %2
                             ret double %r`,
                             double, double, double, double);
}

static foreach (int i; 0 .. 5_000)
{
    mixin("double foo" ~ i.stringof ~ `(double a, double b, double c) {
        return muladdFast(a, b, c);
    }`);
}

On my box, the fast variant compiles in one second, while the slow one takes 7 seconds. So aliasing an __ir* template instantiation should be avoided, wrapping it instead to IR-inline once.

kinke commented 8 months ago

dub --combined

AFAIK, this compiles everything (incl. all dub deps, direct and indirect) to a single object file (= IR module). Each inline IR fragment is generated in its own temporary IR module, which is then 'linked' into the referencing IR module. Maybe this linking step scales very badly with huge object files. So I'd try without --combined (brrr, single-threaded build, and potentially huge mem requirements for the single compiler invocation), and letting LTO or pragma(inline, true) take care of cross-module optimizations.

p0nce commented 8 months ago

My fear with pragma(inline, true) is that without --combined the body of ALL intrinsics will get built by every dependees, making the task of compiling intel-intrinsics each time quite expensive. For now, I just avoid __ir in debug builds which works ok. Say I have 17 packages, that 17 builds of every function body, and there are over a thousands intrinsics, not all of them even need to be inlined. Until now I had trouble everytime getting them to inline when needed without --combined

Should I pragma(inline, true) only those intrinsics that have __ir ? That's possible of course.

The big problem is that there are two meanings to pragma(inline, true) => always inline, and always export the body (header generation) I need one without the other. I never want to force the compiler to inline. And similarly, in pragma(inline, cond) I don't always want a false to mean it shouldn't ever inline.

kinke commented 8 months ago

Well with --combined, that's definitely what dependees get, everything needs to be built at once, into a single huge object file (similar to full LTO, at compile-time). Without --combined, they'll just link a built-once static library of intel-intrinsics, with one object file per D module.

If a pragma(inline, true) function is used by the dependee, it is semantically analyzed once when compiling the dependee, and codegen'd into every referencing IR module ('object file') of the dependee (so that the optimizer can inline it later). So only used pragma(inline, true) functions are codegen'd when compiling the dependee.

p0nce commented 8 months ago

I need to try and see if there is any performance loss. I trust the huge single object file and have not had the same experience with multiple translation units in the past. Also it tends to be quicker to full build.

Also do you agree that without pragma(inline, xxx) the compiler will be able to inline or not in codegen? It has real benefits for performance.

kinke commented 8 months ago

The big problem is that there are two meanings to pragma(inline, true) => always inline, and always export the body (header generation)

Well, if a function is supposed to be inlined at every call site, the .di header needs to contain the body.

Also it tends to be quicker to full build.

Oh well, dub... I recommend reggae for builds taking more than a few seconds. That enables parallel and proper incremental builds. And easily allows to add D flags for the whole build.

Also do you agree that without pragma(inline, xxx) the compiler will be able to inline or not in codegen? It has real benefits for performance.

Not sure what you mean. In your case, if a dub package wraps intrinsics, I'd expect ~every 'intrinsic' to be marked with pragma(inline, true), so that it's a proper inlined intrinsic, regardless of how the dependee's build looks like.

PS: It's hard to keep up with your many editings of your posts. ;)

p0nce commented 8 months ago

Well, if a function is supposed to be inlined at every call site, the .di header needs to contain the body.

But you can want to have the body in a .di header while still not inlining the body all the time. In C++ there is inline and force_inline, so they don't have the issue? IIRC we have a single thing in dlang to mean those two things.

I'd expect ~every 'intrinsic' to be marked with pragma(inline, true), so that it's a proper inlined intrinsic

What if it's not faster? intel-intrinsics emulates what is missing in some arch. Some intrinsics like _mm_cmpestrs have a pretty complicated alternative path if SSE4.2 is absent in target. Some intrinsics may be useful not to inline (though I don't have a ready exemple, true).

Well this is frustrating, I point to 50ms slowdown for a single __ir_pure and while this is obviously a LLVM and LDC performance bottleneck I'm told that I should stop using a --combined flag that everyone uses, and use reggae (I don't want to!)

Each inline IR fragment is generated in its own temporary IR module, which is then 'linked' into the referencing IR module. Maybe this linking step scales very badly with huge object files.

Is there really no other way to do it?

kinke commented 8 months ago

Well this is frustrating, I point to 50ms slowdown for a single __ir_pure and while this is obviously a LLVM and LDC performance bottleneck I'm told that I should stop using a --combined flag that everyone uses, and use reggae (I don't want to!)

Hold on, I didn't tell you what to do, I'm just offering explanations and avenues to tackle the problem now, by changing the build. I'd never use --combined myself if I can help it [I'm actually using it for building the bundled reggae :D, but that's just laziness and overkill...].

p0nce commented 8 months ago

Yes, I'm sorry. I could probably try to pragma(inline, true) only the needed intrinsics (10% of them) and see what happens. This will also help with redub build actually. For now it's not a problem since release builds can pay the price, and debug builds just use the alt paths instead.

kinke commented 8 months ago

Another route for intrinsics are function literals:

alias muladdFast = (double a, double b, double c)
{
    return __ir!(`%p = fmul fast double %0, %1
                         %r = fadd fast double %p, %2
                         ret double %r`,
                         double, double, double, double)(a, b, c);
};

They have the nice property that they are only codegen'd into each referencing object file. So if intel-intrinsics only had these, it could be a header-only library, or otherwise still be very quick to compile (no lambda to compile). And dependees only codegen what they use (into every referencing object file => inlineable everywhere). They don't need a pragma(inline, true); if wanting to let the optimizer decide what to do.

Edit: This should be very close to the C++ inline semantics. In D, the function literals are codegen'd lazily though, not sure if that's the case with C++ too (i.e., whether the compiler skips codegen of an inline function if it isn't referenced anywhere in the preprocessed .cpp).

kinke commented 7 months ago

I've tried building that Dplug clipit example; some build timings and resulting libclipit.so sizes on my Ubuntu 22 box (24 physical CPU cores) using LDC v1.37.0 (and env var VST2_SDK="" to overcome build errors), best of 3:

dub build --config=VST3 -b release-nobounds --force: 17 secs, 4.4 MB
dub build --config=VST3 -b release-nobounds --force --combined: 15.5 secs, 4.3 MB
reggae --dub-config=VST3 --dub-build-type=release-nobounds && ninja: 3.6 secs, 4.4 MB
reggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=full -linker=gold" && ninja: 3.9 secs, 4.7 MB
reggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=thin -linker=gold" && ninja: 2.1 secs, 4.6 MB

[Note that a rm -rf .reggae .ninja_log is required before each reggae invocation to enforce a fresh build from scratch, to get comparable timings.]

Interestingly, thin LTO seems to speed-up the overall build significantly over non-LTO.

This is mainly to show that reggae is easy to use and way faster; and you might be able to check the performance of these libraries. I'd hope that the LTO reggae builds are on-par with the dub --combined one.

Edit: And with the debug dub build type: 7.8 secs with dub, 4.4 with --combined, and 1.3 with reggae (0.2 for reggae, and 1.1 for ninja).

p0nce commented 7 months ago

Well that's very interesting, possibly full LTO would perhaps yield superior performance (increase in code size might indicate higher inlining amount). And the gains are not too shabby, a bit like redub I think (which I don't use). I will schedule a test for this indeed.

p0nce commented 7 months ago

reggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=thin -linker=gold" && ninja

I mean that this is not a "full" rebuild? But a full rebuild is also unneeded since inlining happens at link stage?

kinke commented 7 months ago

I mean that this is not a "full" rebuild? But a full rebuild is also unneeded since inlining happens at link stage?

Not sure what you mean - with the mentioned rm -rf .reggae .ninja_log, you get a full build from scratch every time, just for benchmarking or comparing with dub --force. Normal incremental builds are simply done via ninja, only re-compiling what has changed and then linking. reggae is similar to CMake.

p0nce commented 7 months ago

Well that's damn fast! Didn't expected that.

ldc-developers / ldc

__ir_pure seems much more expensive in large projects #4388