p4lang / p4c

P4_16 reference compiler
https://p4.org/
Apache License 2.0
682 stars 446 forks source link

Inline some IR methods and constructors #5030

Closed asl closed 5 days ago

asl commented 1 week ago

It does not really make sense to emit them in ir-generated.cpp as they are mostly one-liners and could be further simplified.

asl commented 1 week ago

@fruffy Looks like mac runner image is again broken. But why do we need pkg-config there?

asl commented 1 week ago

What is the benefit of this change? For me it makes sense to have all the implementations in one place. Otherwise you end up jumping between files when reading up on a class.

The main purpose of this is to ensure that simple methods are properly inlined. E.g. does not make any sense to outline one-liners like

    inline bool isBuiltin() const { return name == ParserState::accept || name == ParserState::reject; }

We just pay extra function call for no reason. Defining things inline ensures compiler is able to optimize them further on.

fruffy commented 1 week ago

Is there a measurable effect on performance? We should be mindful not to trade of too much usability/readability and binary size for this.

vlstill commented 1 week ago

In case the compile-time impact is high, how does this change compare to enabling LTO in the build? What I remember from Tofino days, we had rather significant speedup by enabling LTO, presumably exactly because it allows inlining more functions. The advantage is that LTO can be enabled only for release builds so normal development build speeds are not affected.

On a slight tangent, there was an idea of speeding up compilation by using PCH (precompiled header in GCC/clang) for ir-generated.h. As far as I know, it was never tried and it was not completely clear if it was doable, but maybe it is worth investigating for the compilation speed (I don't think C++ modules will "save" us in in any reasonably close future giving their current state and more importantly the tool support state and requirements for P4C build).

asl commented 1 week ago

In case the compile-time impact is high, how does this change compare to enabling LTO in the build? What I remember from Tofino days, we had rather significant speedup by enabling LTO, presumably exactly because it allows inlining more functions. The advantage is that LTO can be enabled only for release builds so normal development build speeds are not affected.

I believe LTO contributes a lot to devirtualization and further optimization opportunities uncovered due to this.

On a slight tangent, there was an idea of speeding up compilation by using PCH (precompiled header in GCC/clang) for ir-generated.h. As far as I know, it was never tried and it was not completely clear if it was doable, but maybe it is worth investigating for the compilation speed

This is good idea, yes, that is worth trying. Modules are likely unfeasible due to circular dependencies here and there. But PCH / PTH might give a decent speedup.

asl commented 1 week ago

So, the performance results are interesting.

Here are the build time for ir-generated.cpp:

main:

Executed in   37.11 secs    fish           external
   usr time   36.05 secs    0.14 millis   36.05 secs
   sys time    0.87 secs    3.76 millis    0.87 secs

ir-inline:

Executed in   31.97 secs    fish           external
   usr time   30.93 secs    0.09 millis   30.93 secs
   sys time    0.76 secs    4.82 millis    0.76 secs

These numbers are pretty stable, so for me it's ~38 second on average on main and ~32 seconds on ir-inline.

The compile time for "ordinary" C++ files looks similar to me, I'm showing individual files here, but averages from multiple runs are pretty much the same:

ir/base.cpp: main:

Executed in    3.01 secs    fish           external
   usr time    3.63 secs    0.12 millis    3.63 secs
   sys time    0.86 secs    2.33 millis    0.86 secs

ir-inline:

Executed in    3.09 secs    fish           external
   usr time    3.69 secs    0.12 millis    3.69 secs
   sys time    0.84 secs    2.25 millis    0.84 secs

And here is frontend/def_use.cpp: main:

Executed in    5.13 secs    fish           external
   usr time    5.65 secs    0.25 millis    5.65 secs
   sys time    0.91 secs    1.63 millis    0.91 secs

ir-inline:

Executed in    5.19 secs    fish           external
   usr time    5.76 secs    0.14 millis    5.76 secs
   sys time    0.93 secs    2.52 millis    0.92 secs

The overall compile time for the whole p4c is pretty noisy. I tried to replicate it several times, but overall the differences are below the noise level.

The performance of the p4c itself improved slightly. Which is even surprising for pretty flat profile dominated by memory allocation, copies and constant clones :) Sadly, I cannot run larger apps with GC off, but GC itself introduced lots of variation to catch ~1-2% difference.

Command Mean [s] Min [s] Max [s] Relative
test/gtestp4c-main --gtest_filter=P4CParserUnroll.switch_20160512 2.945 ± 0.032 2.889 3.211 1.01 ± 0.01
test/gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512 2.909 ± 0.027 2.857 2.977 1.00

@fruffy Please let me know if there are any other checks that you'd want me to run