p4lang / p4c

P4_16 reference compiler
https://p4.org/
Apache License 2.0
648 stars 433 forks source link

Reduce p4c compile time #4674

Open asl opened 2 months ago

asl commented 2 months ago

Currently p4c compile time are quite large compared to other compilers / project given the codebase size. Likely this question / issue was already raised before (otherwise, why there are unity builds here), but I was unable to find the corresponding issue.

I tried to check, if there are any low-hanging fruits here. Looks like not, still few cases are interested. Attaches is the time race report as generated by clang. It could be visualized via Chrome tracing backend, or better, via Speedoscope or Perfetto. The file in question is helpers.cpp from gtest testsuite, though similar patterns are everywhere. The file in question has the longest frontend parse time across the codebase (13 seconds for me).

Interesting observations:

There are some expensive template instantiations here

Template sets that took longest to instantiate:

  1277 ms: IR::Vector<$>::visit_children (64 times, avg 19 ms)
   866 ms: IR::Vector<$>::insert<$> (32 times, avg 27 ms)
   848 ms: std::vector<$>::insert<$> (33 times, avg 25 ms)
   749 ms: std::__dispatch_copy_or_move<$> (154 times, avg 4 ms)
   662 ms: std::__unwrap_and_dispatch<$> (215 times, avg 3 ms)
   589 ms: std::map<$> (70 times, avg 8 ms)
   549 ms: P4V1::ProgramStructure::NamedObjectInfo<$>::emplace (18 times, avg 30 ms)
   536 ms: std::map<$>::emplace<$> (54 times, avg 9 ms)
   430 ms: std::copy<$> (48 times, avg 8 ms)
   425 ms: std::__copy<$> (48 times, avg 8 ms)
   408 ms: std::__tree<$>::__emplace_unique<$> (56 times, avg 7 ms)
   400 ms: std::move<$> (72 times, avg 5 ms)
   386 ms: std::__move<$> (72 times, avg 5 ms)
   377 ms: std::pair<$> (150 times, avg 2 ms)
   371 ms: std::__tree<$> (75 times, avg 4 ms)
   294 ms: P4V1::ProgramStructure::NamedObjectInfo<$> (18 times, avg 16 ms)
   286 ms: std::vector<$>::__swap_out_circular_buffer (42 times, avg 6 ms)
   283 ms: RTTI::TypeInfo<$>::dyn_cast<$> (189 times, avg 1 ms)
   272 ms: std::allocator_traits<$> (123 times, avg 2 ms)
   251 ms: std::__tree<$>::__emplace_unique_key_args<$> (35 times, avg 7 ms)
   250 ms: std::__uninitialized_allocator_move_if_noexcept<$> (42 times, avg 5 ms)
   237 ms: std::vector<$>::vector (58 times, avg 4 ms)
   231 ms: IR::Vector<$>::erase (32 times, avg 7 ms)

Here are the stats across the whole codebase:

Templates that took longest to instantiate:

 16023 ms: boost::basic_format<char>::basic_format (455 times, avg 35 ms)
 15929 ms: Util::CompilerBug::CompilerBug<> (440 times, avg 36 ms)
 15903 ms: IR::Vector<IR::Node>::visit_children (866 times, avg 18 ms)
 15844 ms: Util::P4CExceptionBase::P4CExceptionBase<> (440 times, avg 36 ms)
 14351 ms: boost::basic_format<char>::parse (455 times, avg 31 ms)
 13624 ms: IR::Vector<IR::DpdkSelector>::visit_children (866 times, avg 15 ms)
 13551 ms: IR::Vector<IR::DpdkAction>::visit_children (866 times, avg 15 ms)
 13485 ms: IR::Vector<IR::DpdkHeaderType>::visit_children (866 times, avg 15 ms)
 13393 ms: IR::Vector<IR::Declaration_ID>::visit_children (866 times, avg 15 ms)
 13386 ms: IR::Vector<IR::DpdkExternDeclaration>::visit_children (866 times, avg 15 ms)
 13360 ms: IR::Vector<IR::Type_Var>::visit_children (866 times, avg 15 ms)
 13329 ms: IR::Vector<IR::DpdkAsmStatement>::visit_children (866 times, avg 15 ms)
 13308 ms: IR::Vector<IR::Property>::visit_children (866 times, avg 15 ms)
 13214 ms: IR::Vector<IR::DpdkTable>::visit_children (866 times, avg 15 ms)
 13187 ms: IR::Vector<IR::DpdkStructType>::visit_children (866 times, avg 15 ms)
 13063 ms: IR::Vector<IR::Entry>::visit_children (866 times, avg 15 ms)
 13061 ms: IR::Vector<IR::StatOrDecl>::visit_children (866 times, avg 15 ms)
 13013 ms: IR::Vector<IR::DpdkLearner>::visit_children (866 times, avg 15 ms)
 12911 ms: IR::Vector<IR::ParserState>::visit_children (866 times, avg 14 ms)
 12878 ms: IR::Vector<IR::Declaration>::visit_children (866 times, avg 14 ms)
 12838 ms: IR::Vector<IR::Parameter>::visit_children (866 times, avg 14 ms)
 12771 ms: IR::Vector<IR::KeyElement>::visit_children (866 times, avg 14 ms)
 12757 ms: IR::Vector<IR::ActionListElement>::visit_children (866 times, avg 14 ms)
 12731 ms: IR::Vector<IR::NamedExpression>::visit_children (866 times, avg 14 ms)
 12719 ms: IR::Vector<IR::DpdkDeclaration>::visit_children (866 times, avg 14 ms)
 12569 ms: IR::Vector<IR::Method>::visit_children (866 times, avg 14 ms)
 12548 ms: IR::Vector<IR::StructField>::visit_children (866 times, avg 14 ms)
 12354 ms: IR::Vector<IR::SerEnumMember>::visit_children (866 times, avg 14 ms)
 12329 ms: IR::Vector<IR::SelectCase>::visit_children (866 times, avg 14 ms)
 12231 ms: IR::Vector<IR::AnnotationToken>::visit_children (866 times, avg 14 ms)

Template sets that took longest to instantiate:

411170 ms: IR::Vector<$>::visit_children (27718 times, avg 14 ms)
287032 ms: IR::Vector<$>::insert<$> (13866 times, avg 20 ms)
279278 ms: std::vector<$>::insert<$> (14158 times, avg 19 ms)
219502 ms: std::__dispatch_copy_or_move<$> (64652 times, avg 3 ms)
184992 ms: std::__unwrap_and_dispatch<$> (83282 times, avg 2 ms)
127186 ms: std::move<$> (30874 times, avg 4 ms)
121353 ms: std::__move<$> (30850 times, avg 3 ms)
119899 ms: std::copy<$> (19052 times, avg 6 ms)
119576 ms: RTTI::TypeInfo<$>::dyn_cast<$> (84718 times, avg 1 ms)
117637 ms: std::__copy<$> (19047 times, avg 6 ms)
109498 ms: std::vector<$>::__swap_out_circular_buffer (17921 times, avg 6 ms)
 92251 ms: std::__uninitialized_allocator_move_if_noexcept<$> (18086 times, avg 5 ms)
 80784 ms: std::vector<$>::vector (22750 times, avg 3 ms)
 65467 ms: IR::Vector<$> (14722 times, avg 4 ms)
 64696 ms: std::vector<$>::__construct_at_end<$> (24919 times, avg 2 ms)
 64212 ms: ordered_map<$>::ordered_map (4627 times, avg 13 ms)
 62657 ms: IR::Vector<$>::erase (13853 times, avg 4 ms)
 62412 ms: std::pair<$> (31028 times, avg 2 ms)
 62006 ms: std::map<$> (11351 times, avg 5 ms)
 61258 ms: std::vector<$>::erase (13902 times, avg 4 ms)
 59134 ms: std::vector<$> (28239 times, avg 2 ms)
 53943 ms: std::__uninitialized_allocator_copy<$> (23446 times, avg 2 ms)
 52458 ms: Util::CompilerBug::CompilerBug<$> (12253 times, avg 4 ms)
 51391 ms: Util::P4CExceptionBase::P4CExceptionBase<$> (12397 times, avg 4 ms)
 49646 ms: IR::Vector<$>::parallel_visit_children (14010 times, avg 3 ms)
 48877 ms: std::__unwrap_range<$> (37549 times, avg 1 ms)
 48375 ms: std::map<$>::emplace<$> (6672 times, avg 7 ms)
 47952 ms: std::map<$>::map (8962 times, avg 5 ms)
 47945 ms: IR::IndexedVector<$>::visit_children (18185 times, avg 2 ms)
 44757 ms: safe_vector<$> (20185 times, avg 2 ms)

The file with time profile: helpers.cpp.json.gz

vlstill commented 2 months ago

Thanks for this! I haven't looked into this for some time, but from my time at Tofino I remember some investigations (the deeper once mostly lead nowhere because what happened to Tofino) and some conclusions.

30% is ir-generated.h (actually more due to later template instantiations)

Yes, that matches our observations. We were discussing if precompiled headers would be something that could help with this, but I don't think we got to a serious try. Might be worth investigating.

On top of that, I was concerned with build process and linking as that was low-hanging fruit for us, this will likely work for others (but a lot probably use it already):

6% is frontends/common/parseInput.h with the majority of time spent into v1.0 converters. This likely could be refactored

Maybe we should isolate the P4-14 input support and make it optional. Not all backends need it.

Templates that took longest to instantiate:

16023 ms: boost::basic_format::basic_format (455 times, avg 35 ms) 15929 ms: Util::CompilerBug::CompilerBug<> (440 times, avg 36 ms) 15903 ms: IR::Vector::visit_children (866 times, avg 18 ms) [...]

Template sets that took longest to instantiate:

411170 ms: IR::Vector<$>::visit_children (27718 times, avg 14 ms) 287032 ms: IR::Vector<$>::insert<$> (13866 times, avg 20 ms) [...]

Now we have actually a lot of control over many of these. We can add explicit instantiation declarations for selected instances (e.g. Vector and IndexedVector combos used in IR) and instantiate those explicitly in a .cpp file. Similarly for CompilerBug<>. Given the numbers, this could speed up compilation a lot. A lot of the rest are standard library things, so there I don't expect we could do much better (until we can make used of modules in far future).

asl commented 2 months ago

Yes, that matches our observations. We were discussing if precompiled headers would be something that could help with this, but I don't think we got to a serious try. Might be worth investigating.

I think it's mostly because we're having 200+ IR::Node descendants with lots of virtual methods and wide inheritance. Lots of small things that add together...

On top of that, I was concerned with build process and linking as that was low-hanging fruit for us, this will likely work for others (but a lot probably use it already):

Yeah. Though this was ninja build. And I'm using Apple'd ld64 which is reasonable fast :)

A lot of the rest are standard library things, so there I don't expect we could do much better (until we can make used of modules in far future).

Actually, they could be as both Vector and IndexedVector use lots of STL stuff under the hood. But worth investigating more, yes. The whole compiler is just a pile of visit calls :)

asl commented 2 months ago

And this one is interesting:

 15929 ms: Util::CompilerBug::CompilerBug<> (440 times, avg 36 ms)

Actually, every BUG_CHECK leads to such instantiation, depending on the types of arguments. Maybe there should be some other solution here.

vlstill commented 2 months ago

A lot of the error handling code is from C++11 times, we now have more ways of handling variadics, it might be worth refactoring it. I can even see a way when we just take the argument pack and project it either directly to string, or to a type-erased object (more likely, as we still need to handle source info) and then instead of doing the recursion (if we can't get rid of it) based on all the type combinations, we just instantiate it based on all the lengths of error parametrization. Then we could again apply explicit instatiations if needed.

fruffy commented 2 months ago

Thanks for the analysis. Yes, compile time has been a constant pain point. The Tofino compiler is much worse because it adds a lot more IR nodes which blows up compile time and linking even further.

The only other open-source issue I remember is https://github.com/p4lang/p4c/issues/3980, which is concerned with splitting ir-generated.h. We only need few IR nodes most of the time but have to pull this header in every time. The problem is that the visitor infrastructure coupled with the ir-inline.h and ir-tree-macros.h macros seems to depend on everything being in one place. It is really difficult to untangle or change this.

Other than that I have periodically been running IWYU to get some better sense what is being included/used in some classes: https://github.com/p4lang/p4c/pull/3767

5% of compile time is big_int.h that is includes via big_int_utils.h that is included via stringify.h almost everywhere (as it is subsequently included via error_handler.h used by exceptions.h, the latter provides BUG_CHECK, etc). Maybe this could be refactored a bit better. Though, big ints are everywhere, so it will still be here.

We could move the toString to big_int_utils here? big_int is primarily (only?) used for constants. Maybe we can do something there...

A lot of the error handling code is from C++11 times, we now have more ways of handling variadics, it might be worth refactoring it. I can even see a way when we just take the argument pack and project it either directly to string, or to a type-erased object (more likely, as we still need to handle source info) and then instead of doing the recursion (if we can't get rid of it) based on all the type combinations, we just instantiate it based on all the lengths of error parametrization. Then we could again apply explicit instatiations if needed.

I tried some refactoring in https://github.com/p4lang/p4c/pull/3774 once. The challenge is to preserve the source location accurately, which may require some C++20 features. If you wanted to get rid of the macro setup..

asl commented 2 months ago

We could move the toString to big_int_utils here? big_int is primarily (only?) used for constants. Maybe we can do something there...

The problem is header boost dependency. So, moving out will not help a lot :)

vlstill commented 2 months ago

I tried some refactoring in #3774 once. The challenge is to preserve the source location accurately, which may require some C++20 features. If you wanted to get rid of the macro setup..

That would require not only C++20, but more importantly GCC 11... Maybe once Ubuntu 20 goes out of support we can consider updating? Other old popular systems (RHEL 8) already have to depend on non-default GCC now.

EDIT: to get source_location that is... There could be probably some workaround with keeping macros but making the compilation faster.

fruffy commented 2 months ago

The problem is header boost dependency. So, moving out will not help a lot :)

Not sure I follow. If we move out toString for big_int we can at least get rid of the big_int include there. We could also try to forward declare big_int in ir.h. But that may break things.

asl commented 2 months ago

Not sure I follow. If we move out toString for big_int we can at least get rid of the big_int include there. We could also try to forward declare big_int in ir.h. But that may break things.

Right. But big_int is everywhere in the compiler. So it will simply be included in other place – we'd need to parse it in any case somewhere.

fruffy commented 2 months ago

Right. But big_int is everywhere in the compiler. So it will simply be included in other place – we'd need to parse it in any case somewhere.

I am thinking it is not that widely used. Usually only in places where we work with constants. It might be possible to factor it out.

Second, we used to have GMP support actually: https://github.com/p4lang/p4c/pull/3485 But it was removed because of licensing concerns/CI waste. It could be brought back, if there is really a need.

asl commented 2 months ago

I am thinking it is not that widely used. Usually only in places where we work with constants. It might be possible to factor it out.

Ok, maybe good to go then, yes.

vlstill commented 2 months ago

Building a debug version of a downstream compiler with hyperfine -p 'make clean' -m 5 -- make. Using GCC 11 & LLD linker, ccache disabled, on a dedicated machine with enough CPUs and RAM and storage on local SSD.

original:

Benchmark 1: make
  Time (mean ± σ):     414.572 s ±  5.070 s    [User: 9543.740 s, System: 825.331 s]
  Range (min … max):   411.245 s … 423.311 s    5 runs

With explicit intantiations of Vector and IndexedVector for types used with them in ir-generated.h:

Benchmark 1: make
  Time (mean ± σ):     386.579 s ±  5.057 s    [User: 9083.205 s, System: 815.379 s]
  Range (min … max):   382.127 s … 393.961 s    5 runs

... so something like almost 7 % in this case. Not a huge difference, but noticeable for such an easy change I'd say. I'll try a few more template-instantiation tricks and then open a PR with this.

asl commented 2 months ago

@vlstill One step at a time. There is no single place that contributes to the compile time, but 7% improvement is already a lot.

vlstill commented 2 months ago

For release the story is similar. Also I forgot to mention this is a non-unity build. A release build this time, otherwise the same comparison:

# original
  Time (mean ± σ):     588.054 s ±  5.932 s    [User: 12029.151 s, System: 615.580 s]
  Range (min … max):   580.889 s … 593.137 s    5 runs

# Vector + IndexedVector:
  Time (mean ± σ):     533.384 s ±  8.694 s    [User: 11267.462 s, System: 597.603 s]
  Range (min … max):   523.313 s … 544.670 s    5 runs

So little over 9 % for release.

asl commented 2 months ago

I am thinking it is not that widely used. Usually only in places where we work with constants. It might be possible to factor it out.

Well, the thing it: it is a part of IR::Constant. So, essentially it is everywhere we're include IR headers. We may factor it out using pimpl or similar approach though.