jrhemstad commented 6 years ago

Feature request

As anyone who has built libgdf recently has surely noticed, the time to compile the library from scratch has grown significantly in the last few months. For example, compiling on all 10 cores of my i9-7900X @ 3.30GHz takes 11 minutes as reported by time make -j.

Compiling with -ftime-report may be a good place to start to see where all the compilation time is being spent.

This is likely due to the large amount of template instantiation that is required for instantiating functions for all possible types. We should make sure that best practices are being followed in template instantiation such that a template for a given type is only having to be instantiated once via explicit instantiation.

Much of our code is implemented in headers, which causes it to be recompiled everywhere that header is included. Using pre-compiled headers may help: https://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html http://itscompiling.eu/2017/01/12/precompiled-headers-cpp-compilation/

Furthermore, code should be scrubbed from excessive and unnecessary #includes. Compiling with -MD will show what files are being included

Here's a Clang tool that ensures you only include the necessary headers: https://github.com/include-what-you-use/include-what-you-use

Here's a Clang tool to profile time spent in template instantiation: https://github.com/mikael-s-persson/templight

harrism commented 6 years ago

Do the clang tools work with nvcc? Much of our template time is in nvcc. You can use nvcc --time foo.csv to dump timing for different nvcc phases.

jrhemstad commented 6 years ago

I'm not sure. You can technically get Clang to compile device code, so that may be a path worth exploring using Clang + these tools.

hcho3 commented 4 years ago

Any updates on this? I'd love to use precompiled headers with CUDA projects.

harrism commented 4 years ago

Compile time continues to grow, but that is largely because our supported feature set and supported types continue to grow. In 0.15 we are aiming to add at least 10 new types (4 unsigned int types, 4 timestamp types, list column type, decimal fixed-point type). Naturally this will increase compile time and binary size.

Meanwhile, in 0.14 we dropped all of the legacy APIs that were previously deprecated, which reduced compile time a bit, and significantly reduced binary size. There have been and will continue to be various efforts to reduce compile time of certain components. We are investigating possibly splitting libcudf into multiple libraries.

We have not discussed precompiled headers.

beckernick commented 3 years ago

@jrhemstad @harrism , is this still a relevant issue?

jrhemstad commented 3 years ago

Our compile time is worse than ever, so I guess its still relevant. We could benefit from someone putting in a concerted effort to eliminate unnecessary includes across the library.

vyasr commented 1 year ago

I'm not sure. You can technically get Clang to compile device code, so that may be a path worth exploring using Clang + these tools.

Out of curiosity I gave this a quick shot. (Unsurprisingly) clang does not currently support the experimental CUDA features that we have enabled (--expt-extended-lambda --expt-relaxed-constexpr) so the compilation terminates pretty quickly. Not sure if there are other downstream issues that we would face if we attempted this after stripping those out (not suggesting that we should, although #7795 remains open so maybe it is worth pursuing).

jrhemstad commented 1 year ago

clang does not currently support the experimental CUDA features that we have enabled (--expt-extended-lambda --expt-relaxed-constexpr)

Pretty sure clang supports those features natively without the need for any extra compile flags. I'm guessing the error was caused by clang not recognizing those flags.

vyasr commented 1 year ago

You're right, it does. I removed those and made some progress, but not nearly enough for a working build with clang yet. Here's a list of necessary changes so far:

Remove all CUDF_CUDA_FLAGS set in ConfigureCUDA.cmake. Most are either unsupported or ignored (including some of the warnings flags), so the blanket removal is easiest.
Remove [[nodiscard]] attributes, which don't appear to be supported in clang device code yet.
Set -D__STRICT_ANSI__ as a CUDA compiler flag. Otherwise it tries to compile float128 code, which is not yet supported in clang.
Pass -fcuda-allow-variadic-functions to the clang compiler (or via the CMake configure CLI -DCMAKE_CUDA_FLAGS="-Xclang -fcuda-allow-variadic-functions"

At this point I start seeing failures like this:

...type_traits:2777:5: error: no type named 'type' in 'std::invoke_result<cudf::detail::indexalator_factory::nullable_index_accessor, int>'

and

error: type 'thrust::transform_iterator<(lambda at ...gather.cu:49:26), cudf::detail::input_indexalator>::super_t' (aka 'iterator_adaptor<transform_iterator<(lambda at ...gather.cu:49:26), cudf::detail::input_indexalator, thrust::use_default, thrust::use_default>, cudf::detail::input_indexalator, int, thrust::use_default, std::random_access_iterator_tag, int>') is not a direct or virtual base of 'thrust::transform_iterator<(lambda at ...gather.cu:49:26), cudf::detail::input_indexalator>'

I need to track this down a bit further, but it looks like some aspect of how thrust SFINAEs different code paths isn't supported in clang device code yet either.

PointKernel commented 1 year ago

I tried to build cuco with clang about a year ago and was blocked by its dependencies like thrust or libcudacxx that cannot be built with clang. To find how much effort is required to build device code with clang, I would suggest starting with a smaller library like cuco and see how it goes from there.

Related issues:

vyasr commented 1 year ago

Well then... looks like we've got to work our way all the way up the stack for this. For the purpose of something like clang-tidy we might be able to get some partial results based on the discussion in https://github.com/rapidsai/raft/pull/424, but that's probably only partial support at best and I don't know if that will work with the other tools of interest like IWYU.

vyasr commented 2 weeks ago

Compile times are an ever-present problem for us. This issue as currently framed isn't clearly actionable, so let's lay out some concrete points.

We should make sure that best practices are being followed in template instantiation

379 implemented the `type_dispatcher`, which now controls all our template instantiations and ensures that we have a minimal set.

Compiling with -ftime-report may be a good place to start to see where all the compilation time is being spent.

Since #9631 we have been tracking build times in CI. We monitor this and keep an eye on TUs that are particularly slow to compile. Where necessary, we have reacted to slow compilation by reorganizing the code and explicitly instantiating templates.

Furthermore, code should be scrubbed from excessive and unnecessary #includes.

This seems like the main action item remaining. As discussed above, include-what-you-use is a good tool for this, so I would consider evaluating and either implementing or rejecting that as the only thing left to do here. Since clang compilation is the bottleneck here, I propose that we just get our C/C++ source files working first. Once we have systematic approaches in place for that and are regularly checking on the quality, we can incrementally work up to getting CUDA source files working since that is a much heavier lift (and at that point we can also split between host and device code in cu files). This is similar to the approach we are taking in #16958 for clang-tidy.

rapidsai / cudf

Compile times are growing significantly #581

Feature request

379 implemented the `type_dispatcher`, which now controls all our template instantiations and ensures that we have a minimal set.

rapidsai / cudf

Compile times are growing significantly #581

Feature request

379 implemented the type_dispatcher, which now controls all our template instantiations and ensures that we have a minimal set.

379 implemented the `type_dispatcher`, which now controls all our template instantiations and ensures that we have a minimal set.