Open p0nce opened 1 year ago
And that can be avoided by not using __ir_pure
I've just stumbled upon: https://github.com/ldc-developers/ldc/blob/3eb31901b3d253b3f23b8599111d930994696ef7/gen/toir.cpp#L718
This means that the IR fragment is generated (in its own IR module) and then 'linked' for every call site, instead of once per template instantiation of __ir[Ex][_pure]
. I'm pretty sure this makes it extremely slow.
Confirmed:
import ldc.llvmasm;
version (all)
{
// fast variant - IR-inlining once
pragma(inline, true)
double muladdFast(double a, double b, double c)
{
return __ir!(`%p = fmul fast double %0, %1
%r = fadd fast double %p, %2
ret double %r`,
double, double, double, double)(a, b, c);
}
}
else
{
// slow variant - IR-inlining at every call site
alias muladdFast = __ir!(`%p = fmul fast double %0, %1
%r = fadd fast double %p, %2
ret double %r`,
double, double, double, double);
}
static foreach (int i; 0 .. 5_000)
{
mixin("double foo" ~ i.stringof ~ `(double a, double b, double c) {
return muladdFast(a, b, c);
}`);
}
On my box, the fast variant compiles in one second, while the slow one takes 7 seconds. So aliasing an __ir*
template instantiation should be avoided, wrapping it instead to IR-inline once.
dub --combined
AFAIK, this compiles everything (incl. all dub deps, direct and indirect) to a single object file (= IR module). Each inline IR fragment is generated in its own temporary IR module, which is then 'linked' into the referencing IR module. Maybe this linking step scales very badly with huge object files. So I'd try without --combined
(brrr, single-threaded build, and potentially huge mem requirements for the single compiler invocation), and letting LTO or pragma(inline, true)
take care of cross-module optimizations.
My fear with pragma(inline, true)
is that without --combined
the body of ALL intrinsics will get built by every dependees, making the task of compiling intel-intrinsics each time quite expensive. For now, I just avoid __ir in debug builds which works ok. Say I have 17 packages, that 17 builds of every function body, and there are over a thousands intrinsics, not all of them even need to be inlined. Until now I had trouble everytime getting them to inline when needed without --combined
Should I pragma(inline, true) only those intrinsics that have __ir ? That's possible of course.
The big problem is that there are two meanings to pragma(inline, true) => always inline, and always export the body (header generation) I need one without the other. I never want to force the compiler to inline. And similarly, in pragma(inline, cond)
I don't always want a false
to mean it shouldn't ever inline.
Well with --combined
, that's definitely what dependees get, everything needs to be built at once, into a single huge object file (similar to full LTO, at compile-time). Without --combined
, they'll just link a built-once static library of intel-intrinsics
, with one object file per D module.
If a pragma(inline, true)
function is used by the dependee, it is semantically analyzed once when compiling the dependee, and codegen'd into every referencing IR module ('object file') of the dependee (so that the optimizer can inline it later). So only used pragma(inline, true)
functions are codegen'd when compiling the dependee.
I need to try and see if there is any performance loss. I trust the huge single object file and have not had the same experience with multiple translation units in the past. Also it tends to be quicker to full build.
Also do you agree that without pragma(inline, xxx)
the compiler will be able to inline or not in codegen? It has real benefits for performance.
The big problem is that there are two meanings to pragma(inline, true) => always inline, and always export the body (header generation)
Well, if a function is supposed to be inlined at every call site, the .di header needs to contain the body.
Also it tends to be quicker to full build.
Oh well, dub... I recommend reggae for builds taking more than a few seconds. That enables parallel and proper incremental builds. And easily allows to add D flags for the whole build.
Also do you agree that without pragma(inline, xxx) the compiler will be able to inline or not in codegen? It has real benefits for performance.
Not sure what you mean. In your case, if a dub package wraps intrinsics, I'd expect ~every 'intrinsic' to be marked with pragma(inline, true)
, so that it's a proper inlined intrinsic, regardless of how the dependee's build looks like.
PS: It's hard to keep up with your many editings of your posts. ;)
Well, if a function is supposed to be inlined at every call site, the .di header needs to contain the body.
But you can want to have the body in a .di header while still not inlining the body all the time.
In C++ there is inline
and force_inline
, so they don't have the issue?
IIRC we have a single thing in dlang to mean those two things.
I'd expect ~every 'intrinsic' to be marked with pragma(inline, true), so that it's a proper inlined intrinsic
What if it's not faster? intel-intrinsics emulates what is missing in some arch.
Some intrinsics like _mm_cmpestrs
have a pretty complicated alternative path if SSE4.2 is absent in target.
Some intrinsics may be useful not to inline (though I don't have a ready exemple, true).
Well this is frustrating, I point to 50ms slowdown for a single __ir_pure
and while this is obviously a LLVM and LDC performance bottleneck I'm told that I should stop using a --combined flag that everyone uses, and use reggae
(I don't want to!)
Each inline IR fragment is generated in its own temporary IR module, which is then 'linked' into the referencing IR module. Maybe this linking step scales very badly with huge object files.
Is there really no other way to do it?
Well this is frustrating, I point to 50ms slowdown for a single __ir_pure and while this is obviously a LLVM and LDC performance bottleneck I'm told that I should stop using a --combined flag that everyone uses, and use reggae (I don't want to!)
Hold on, I didn't tell you what to do, I'm just offering explanations and avenues to tackle the problem now, by changing the build. I'd never use --combined
myself if I can help it [I'm actually using it for building the bundled reggae :D, but that's just laziness and overkill...].
Yes, I'm sorry. I could probably try to pragma(inline, true) only the needed intrinsics (10% of them) and see what happens. This will also help with redub build actually. For now it's not a problem since release builds can pay the price, and debug builds just use the alt paths instead.
Another route for intrinsics are function literals:
alias muladdFast = (double a, double b, double c)
{
return __ir!(`%p = fmul fast double %0, %1
%r = fadd fast double %p, %2
ret double %r`,
double, double, double, double)(a, b, c);
};
They have the nice property that they are only codegen'd into each referencing object file. So if intel-intrinsics
only had these, it could be a header-only library, or otherwise still be very quick to compile (no lambda to compile). And dependees only codegen what they use (into every referencing object file => inlineable everywhere). They don't need a pragma(inline, true);
if wanting to let the optimizer decide what to do.
Edit: This should be very close to the C++ inline
semantics. In D, the function literals are codegen'd lazily though, not sure if that's the case with C++ too (i.e., whether the compiler skips codegen of an inline
function if it isn't referenced anywhere in the preprocessed .cpp).
I've tried building that Dplug clipit
example; some build timings and resulting libclipit.so
sizes on my Ubuntu 22 box (24 physical CPU cores) using LDC v1.37.0 (and env var VST2_SDK=""
to overcome build errors), best of 3:
dub build --config=VST3 -b release-nobounds --force
: 17 secs, 4.4 MBdub build --config=VST3 -b release-nobounds --force --combined
: 15.5 secs, 4.3 MBreggae --dub-config=VST3 --dub-build-type=release-nobounds && ninja
: 3.6 secs, 4.4 MBreggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=full -linker=gold" && ninja
: 3.9 secs, 4.7 MBreggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=thin -linker=gold" && ninja
: 2.1 secs, 4.6 MB[Note that a rm -rf .reggae .ninja_log
is required before each reggae invocation to enforce a fresh build from scratch, to get comparable timings.]
Interestingly, thin LTO seems to speed-up the overall build significantly over non-LTO.
This is mainly to show that reggae is easy to use and way faster; and you might be able to check the performance of these libraries. I'd hope that the LTO reggae builds are on-par with the dub --combined
one.
Edit: And with the debug
dub build type: 7.8 secs with dub, 4.4 with --combined
, and 1.3 with reggae (0.2 for reggae, and 1.1 for ninja).
Well that's very interesting, possibly full LTO would perhaps yield superior performance (increase in code size might indicate higher inlining amount). And the gains are not too shabby, a bit like redub I think (which I don't use). I will schedule a test for this indeed.
reggae --dub-config=VST3 --dub-build-type=release-nobounds --dflags="--flto=thin -linker=gold" && ninja
I mean that this is not a "full" rebuild? But a full rebuild is also unneeded since inlining happens at link stage?
I mean that this is not a "full" rebuild? But a full rebuild is also unneeded since inlining happens at link stage?
Not sure what you mean - with the mentioned rm -rf .reggae .ninja_log
, you get a full build from scratch every time, just for benchmarking or comparing with dub --force
. Normal incremental builds are simply done via ninja
, only re-compiling what has changed and then linking. reggae is similar to CMake.
Well that's damn fast! Didn't expected that.
Reference: https://github.com/AuburnSounds/intel-intrinsics/issues/130
Problem statement It seems some functions that instantiate a template get a lot more expensive to generate code for, in larger projects. EDIT: actually, it seems everything that uses
__ir_pure
pays a (growing) costExample:
_mm_unpacklo_epi8
takes 1ms 45us in https://github.com/AuburnSounds/intel-intrinsics/issues/130_mm_unpackhi_ps
, takes more than 50ms.How to reproduce You can reproduce by building this project: https://github.com/AuburnSounds/Dplug/tree/master/examples/clipit with LDC 1.32.1 and
--ftime-trace
(typedub --combined
). All intrinsics with__ir_pure
take about 50ms.intel-intrinsics
then become a very significant contributor to total build times (this is not the -g regression)._At first I thought this was all about
ldc.simd.shufflevector
having too many CTFE, when when precomputing the LLVM IR and using__ir_pure
instead the performance of build is the same, or even reduced._