Make C++ incremental compilation faster

adammurdoch commented 6 years ago

Split out of #279 as an additional piece of work. The performance of the C/C++ dependency analysis is crucial to a good developer experience. It is invoked whenever some source or header file changes and is used not just to determine what to recompile but also which build cache entries to use.

There is plenty of low hanging fruit in this area and we should invest in making this faster.

adammurdoch commented 6 years ago

There are several issues with the current implementation of compilation for C/C++:

All dependency analysis, bookkeeping and up-to-date checks are done twice, once in the depend task and again in the compile task. The same analysis happens in both tasks. The only real difference between them is that the compile tasks invokes the compiler after analysis and the depend task writes out the dependencies file. This is all wasted work.
There are 2 tasks for each binary, and creating, configuring, scheduling and running 2 tasks is more expensive that doing this for 1 task. Keeping the configuration in sync is also an issue (not addressed yet), as is the complexity of making sure 2 tasks are defined and dealing with the possibility that it is not in the compile task.
The depend task introduces a concurrency regression, as it does not run concurrently with other task in the same project, including the analysis part of the compile tasks.

I plan to spike a solution where the two tasks are merged back together. When out-of-date, the task would perform the dependency analysis to collect the header dependencies, then delegate to the cache services to look for a matching cache entry or run the task action if no match present.

adammurdoch commented 6 years ago

I did up a quick spike which basically just moves the body of the Depend task to an @InputFiles property getter on the compile task, so that the dependency analysis is done during preparing the inputs of the compile task. The spike also removes the depend task from the task graph.

On a test build with 20 projects, 20 source files and 100 header files per project (so a large number of headers given the number of source files), this gives a modest improvement:

Scenario	spike	master	4.3	4.2
up-to-date `:linkDebug :linkRelease` parallel	2.27s	1.33s	1.07s	0.876s
clean `:linkDebug :linkRelease` parallel	57.1s	59.3s	50.7s	46.8s

The up-to-date case is slower as the dependency analysis is run to calculate the header input files. This regression is somewhat intentional for this spike. The goal was to make clean build faster. I plan to take a look at some quick spikes to make the up-to-date checks faster.

adammurdoch commented 6 years ago

Hacking both 4.2 and my spike so that neither traverse any #include directives gives these results on the "many headers" test build above:

Scenario	spike with hack	4.2 with hack
up-to-date `:linkDebug :linkRelease` parallel	0.97s	0.92s
clean `:linkDebug :linkRelease` parallel	50.3s	50.2s

Basically the same values. While this is not anything we want to do, it does confirm that the performance issues are somewhere in dealing with the header dependency graph. Could be in resolving the includes, finding header files in the longer include path, or just dealing with many more edges and header files in the dependency graphs. Or all of the above. The profiler seems to be saying it's due to finding the files (plus more files to find) but I want to confirm this.

adammurdoch commented 6 years ago

On Windows, we shouldn't actually have a longer include path since 4.2 as we have always discovered the system header directories. This says that either the regression is fixed already (it was only due to multiple tasks doing the same work), or there's a regression which isn't due to a longer include path. I guess I'm going have to run these builds on Windows...

big-guy commented 6 years ago

On Windows, we shouldn't actually have a longer include path since 4.2 as we have always discovered the system header directories.

I think that may not be true. Isn't this where we added the system includes? https://github.com/gradle/gradle/blob/REL_4.2.1/subprojects/platform-native/src/main/java/org/gradle/nativeplatform/toolchain/internal/msvcpp/VisualCppPlatformToolProvider.java#L143-L145

I think this isn't seen by the compile task/incremental build because we only add these includes once we get into NativeCompiler and friends.

adammurdoch commented 6 years ago

You're right, the header discovery in 4.2 never saw those system include directories.

adammurdoch commented 6 years ago

I've merges some improvements to master:

Merged Depend into AbstractNativeCompileTask, to avoid the cost of a second task for each binary, for example managing the same sets of input files or persisting the same state for each task.
Don't use discovered inputs, always calculate the include files during input file snapshotting. This avoid the cost of creating and tracking a bucketload of File instances, but exposes inefficiencies in the input file analysis code.
Cache in memory the result of scanning for the source files for the compile task and reuse for calculating the include files and also for task execution, if the task end up doing something.
Use file system mirror to probe for include files, rather than direct calls on File.
Improvements to calculating include file dependency graphs, to make cheaper and use less memory.

The changes are here: https://github.com/gradle/gradle/compare/42958134b46b814ac54a740b784dcd67afae282c...6ae577fa149b030904c8a978c632bb1c100bb0ff

Some results from a build with 20 projects, 100 source files and 50 header files per project on a Linux machine:

Scenario	master	4.4	4.3	4.2
ABI change `:assemble`	3.40s	3.55s	3.54s	3.24s
no change `:assemble`	0.42s	0.36s	0.37s	0.33s
clean `:assemble`	55.97s	58.91s	58.77s	53.74s
non-ABI change `:assemble`	1.97s	2.11s	2.10s	1.70s

This test build does not use any macro includes to separate that issue out.

The changes bring us closer to 4.2 performance, and there are further improvements that can be made to avoid probing for header files many times per build and more efficiently deal with large or many include file dependency graphs.

big-guy commented 6 years ago

When this is improved, ping https://github.com/gradle/gradle-native/issues/502

lacasseio commented 6 years ago

I started experimenting with changes. My assessment so far shows that the includes analysis seems to be a big part of the bottleneck. Each time we encounter a macro includes or unresolved macro includes, we aren't caching the result. Since the project contains multiple time a very similar include graph, we will reanalyze the file to get mostly the same result. I think we could still cache those results given we can base the cache on the visible macros. Basically, we would be saying based on a set of X macros, if we analyze file Y, we should always have result Z. For the file without macro includes, we can base the caching on the file: given file Y we should always have result Z.

The visited file details are also not shared between tasks and projects. We could look at sharing those around.

A quick POC of the previous ~show an 88% decrease in visitFile sample according to JFR~ (the number of iterations didn't match between both run so the sampling is useless) and a 69% decrease in wall clock compilation time (4.2.1: 512 ms vs. nightly: 14561 ms vs. poc: 4503 ms).

The next couple updates will try to decompose those changes into more measurable pieces.

lacasseio commented 6 years ago

Sharing the include analysis cache POC brings the wall clock build time to 13852 ms vs. 14816 ms with the nightly (here is the flame graph:

lacasseio commented 6 years ago

Caching the include analysis details according to the visible macros POC brings the wall clock build time to 3538 ms vs. 13865 ms with the nightly (here is the new flame graph.

lacasseio commented 6 years ago

I forgot to force macros collection before caching the result, so it's not as impressive, the wall clock time of the previous POC is 5085ms. I'm sure we can improve the implementation as no effort was made to make it efficient.

lacasseio commented 6 years ago

The next bottleneck seems to be during the resolution of macros, specifically around preprecessor libraries such as Boost.Preprocessor. I dumbly tried arranging the macro lookup as a hash table. However, this slowed down the build process by a factor of about 10.

I will try to see if we can create the macro hash table during parsing instead of during the analysis.

big-guy commented 6 years ago

@lacasseio Is it worth reviewing what you already have so we can get that into master ASAP?

lacasseio commented 6 years ago

I'm getting the number for what I just commented on. Spoiler alert, it's another win! I will polish that and get everything ready for review.

lacasseio commented 6 years ago

By creating a hash table during parsing of the macros and macro functions, we get the following numbers. We are now faster on a clean assemble than 4.2.1 but still a bit slower than 4.2.1 for the assemble scenario (poc: 1617ms vs 4.2.1: 633ms).

adammurdoch commented 6 years ago

I think it would be good to stop further experiments and get whatever improvements we've already discovered into master, rather than piling experiments on top of each other, to lock in some gains. We can then keep repeating the process (measure, spike one change, measure, implement, merge, goto 1) until we're happy with performance.

lacasseio commented 6 years ago

Thanks @adammurdoch for the word of wisdom of splitting the change into measurable units. I did just that and I created PR https://github.com/gradle/gradle/pull/4616 containing the most impactful change.

lacasseio commented 6 years ago

PR https://github.com/gradle/gradle/pull/4617 is based on the previous PR and shows the additional changes that allow Gradle to come within ~500ms of the Gradle 4.2.1 performance.

adammurdoch commented 6 years ago

There's a bunch of new test coverage for C++ incremental compile here: https://github.com/gradle/gradle/pull/4652

These can be used to make sure any caching we do isn't making assumptions it shouldn't.

lacasseio commented 6 years ago

I closed the unmerged PR as my experimentation so far have only produced incorrect result or a regression.

lacasseio commented 6 years ago

Instead of looking at caching more result, I decided to look at the root cause of why some headers are flagged with HasMacroInclude. Disclaimer, improving these following scenario may have no effect on performance as it only takes one bad header to invalidate an entire chain of includes which prevent them to be cached. Here is my finding sorted by the most common to least common:

Collateral

These are headers that are flag as having macros because happens to include a problematic header down the chain.

API Prefix Through Includes

These are headers contain macro includes like this:

#ifdef BOOST_HAS_ABI_HEADERS
#  include BOOST_ABI_PREFIX
#endif
//...
#ifdef BOOST_HAS_ABI_HEADERS
#  include BOOST_ABI_SUFFIX
#endif

This pattern is usually used when developers want to dynamically configure the namespace of the code in question. In this case, the headers contain templates and are most likely used some kind of metaprogramming.

Using Preprocessor Library

These are quite tricky as they heavily rely on the preprocessor behavior used during the compilation:

#  define BOOST_PP_ITERATION_PARAMS_1 (3,(0,BOOST_FUNCTION_MAX_ARGS,<boost/function/detail/function_iterate.hpp>))
#  include BOOST_PP_ITERATE()
#  undef BOOST_PP_ITERATION_PARAMS_1

Header Hiding

These are headers try to be smarter than the build tool by creating a level of indirection to fool the analyzer:

// Since this preprocessor path is almost never taken, we hide these header
// dependencies so that build tools don't find them.
//
#define BOOST_REGEX_H1 <boost/thread/once.hpp>
#define BOOST_REGEX_H2 <boost/thread/recursive_mutex.hpp>
#define BOOST_REGEX_H3 <boost/thread/lock_types.hpp>
#include BOOST_REGEX_H1
#include BOOST_REGEX_H2
#include BOOST_REGEX_H3
#undef BOOST_REGEX_H1
#undef BOOST_REGEX_H2
#undef BOOST_REGEX_H3

Optional Headers

These are headers including a dynamic header only if they are defined:

#if defined(BOOST_ASIO_CUSTOM_HANDLER_TRACKING)
# include BOOST_ASIO_CUSTOM_HANDLER_TRACKING
#elif defined(BOOST_ASIO_ENABLE_HANDLER_TRACKING)
# include <boost/system/error_code.hpp>
# include <boost/asio/detail/cstdint.hpp>
# include <boost/asio/detail/static_mutex.hpp>
# include <boost/asio/detail/tss_ptr.hpp>
#endif // defined(BOOST_ASIO_ENABLE_HANDLER_TRACKING)

Preprocessor Header Library

These are headers part of a preprocessor headers library. See Boost.Preprocessor.

Overridable Headers

These are headers that can optionally override the included headers via a macro. If the override macro isn't defined, then a default include is used declared locally:

#  ifndef BOOST_REGEX_USER_CONFIG
#     define BOOST_REGEX_USER_CONFIG <boost/regex/user.hpp>
#  endif

#  include BOOST_REGEX_USER_CONFIG

??? (I don't know how to define those)

Doesn't seems to be too common. Only a single unique hit in our sample project.

#if BOOST_WORKAROUND(__IBMCPP__, BOOST_TESTED_AT(700))
#   define AUX778076_INCLUDE_STRING BOOST_PP_STRINGIZE(boost/mpl/aux_/preprocessed/AUX778076_PREPROCESSED_HEADER)
#   include AUX778076_INCLUDE_STRING
#   undef AUX778076_INCLUDE_STRING
#else
#   include BOOST_PP_STRINGIZE(boost/mpl/aux_/preprocessed/AUX778076_PREPROCESSED_HEADER)
#endif

Unavailable Platform Headers

I'm unsure how these are handled with our resolver code. Some configuration headers use conditional to include platform headers and some may be absent on certain platform:

// Clang / libc++ detection.
#if defined(__clang__)
# if (__cplusplus >= 201103)
#  if __has_include(<__config>)
#   include <__config>
#   if defined(_LIBCPP_VERSION)
#    define BOOST_ASIO_HAS_CLANG_LIBCXX 1
#   endif // defined(_LIBCPP_VERSION)
#  endif // __has_include(<__config>)
# endif // (__cplusplus >= 201103)
#endif // defined(__clang__)

// Android platform detection.
#if defined(__ANDROID__)
# include <android/api-level.h>
#endif // defined(__ANDROID__)

lacasseio commented 6 years ago

The performance for 4.2.1 with the updated code on https://github.com/lacasseio/boost-asio is

ThadHouse commented 5 years ago

I think __has_include needs to be included with this. It doesn't seem to support incremental compilation either, and since its standardized its extremely awkward to not support it.

lacasseio commented 5 years ago

@ThadHouse Is what you are seeing Gradle snapshotting all the header files and an info log saying it's reverting to taking all the headers into account?

gradle / gradle-native