rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 908 forks source link

[FEA] Move has_nulls functor template argument to member variable in row comparators and row hashers #6952

Closed davidwendt closed 2 years ago

davidwendt commented 3 years ago

Currently, many libcudf column APIs are compiled into two paths -- one handles nulls and the other expects no nulls. The no-nulls path is generally faster because it removes the cost of accessing the validity bitmask. To create these two paths, we generally use a template argument on the functor or function mostly consistently named has_nulls which is usually determined early in the API code logic.

Using a template argument means we can create one source code that includes handling the null logic surrounded by an if(has_nulls) statement. The compiler will then create the two paths for us from the single source. We only need to invoke them individually as appropriate. What's more, the compiler optimizer will actually remove the entire if(has_nulls) statement in the has_nulls==false compiled kernel path.

Example of `has_nulls` templated functor invocation Here is an example of invoking a `has_nulls` templated functor in `cpp/src/stream_compaction/drop_duplicates.cu`: https://github.com/rapidsai/cudf/blob/b45fd4dec4776408dde5ad442bde756c6689a166/cpp/src/stream_compaction/drop_duplicates.cu#L145-L173 Note the only difference between the `if` and `else` clauses is the instantiation of the `row_equality_comparator` with the template parameter. The `row_equality_comparator` is defined in the `cpp/include/cudf/table/row_operators.cuh`: https://github.com/rapidsai/cudf/blob/b45fd4dec4776408dde5ad442bde756c6689a166/cpp/include/cudf/table/row_operators.cuh#L204-L206 The `has_nulls` template argument is propagated the `element_equality_comparator` which has the [`if(has_nulls)`](https://github.com/rapidsai/cudf/blob/b45fd4dec4776408dde5ad442bde756c6689a166/cpp/include/cudf/table/row_operators.cuh#L176-L184) statement. Some of these invocations can be large and somewhat error prone. https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/sort/sort_impl.cuh#L68-L97 https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/hash/hashing.cu#L138-L174 https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/search/search.cu#L110-L142 https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/groupby/sort/group_nunique.cu#L49-L90 Granted, some of these are only a few lines of code and clang formatting perhaps makes them look larger.

All of this means we can create a fast-path for non-null column cases but at the cost of compile time and size since much of the code is duplicated for us by the compiler. In some places, the null handling is a minimal set of instructions compared to the overall kernel code size. Also, in APIs that operate on tables (e.g. sort), a single null in any column in the table will cause the API to execute through the null-handling path for all tables/columns.

The dual compile path effects some of our biggest compile time offenders.

Here are the current top 30 source files ordered by compile time ``` 1265968 CMakeFiles/cudf_base.dir/src/sort/sort.cu.o 1259518 CMakeFiles/cudf_base.dir/src/sort/stable_sort.cu.o 991023 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o 983376 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o 728204 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o 727802 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o 718036 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o 516079 CMakeFiles/cudf_rolling.dir/src/rolling/rolling.cu.o 486964 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o 444853 CMakeFiles/cudf_reductions.dir/src/reductions/scan.cu.o 442416 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o 414210 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o 396110 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o 383931 CMakeFiles/cudf_base.dir/src/strings/sorting/sorting.cu.o 304577 CMakeFiles/cudf_reductions.dir/src/reductions/std.cu.o 300738 CMakeFiles/cudf_base.dir/src/stream_compaction/distinct_count.cu.o 298303 CMakeFiles/cudf_reductions.dir/src/reductions/var.cu.o 268265 CMakeFiles/cudf_base.dir/src/search/search.cu.o 256580 CMakeFiles/cudf_base.dir/src/unary/math_ops.cu.o 247038 CMakeFiles/cudf_base.dir/src/groupby/sort/group_nunique.cu.o 222540 CMakeFiles/cudf_base.dir/src/groupby/sort/sort_helper.cu.o 219502 CMakeFiles/cudf_base.dir/src/quantiles/quantile.cu.o 204364 CMakeFiles/cudf_base.dir/src/sort/rank.cu.o 203926 CMakeFiles/cudf_base.dir/src/stream_compaction/drop_duplicates.cu.o 188213 CMakeFiles/cudf_base.dir/src/sort/is_sorted.cu.o 185831 CMakeFiles/cudf_ast.dir/src/ast/transform.cu.o 182763 CMakeFiles/cudf_base.dir/src/copying/copy.cu.o 166152 CMakeFiles/cudf_base.dir/src/copying/gather.cu.o 157837 CMakeFiles/cudf_partitioning.dir/src/partitioning/partitioning.cu.o 131314 tests/CMakeFiles/cudftestutil.dir/utilities/column_utilities.cu.o ... ``` The first column is time in milliseconds from a ninja trace on my desktop using g++7 and CUDA 11.0. ![has_nulls_baseline](https://user-images.githubusercontent.com/45795991/101641379-ad304e80-39ff-11eb-8a75-8706804f9d01.png)

Ignoring the current re-ballooning of the reduction APIs, the sort and hashing source files are still very slow. These use the templated comparators and hashers defined in the cpp/include/cudf/table/row_comparators.cuh

I propose to move some of the instances of the has_nulls template parameter to a functor member variable or function parameter as appropriate. Overall this means that has_nulls would be checked at runtime instead of compile time but the runtime if-statement only introduces a extra kernel instruction in the has_nulls path. Since the has_nulls value is set early in the logic before the kernel is launched, the kernel should therefore incur no divergence with other threads.

I prototyped moving has_nulls in the row comparators and row hashers on my local machine which required updating about 20 source files. I compared the outputs for gbenchmarks tests for sort, hash, merge, search, and join and found no signficant change in performance. Most of the benchmarks do not include nulls so they were a good measure of the impact of the extra instruction. (I can attach the benchmark results if necessary).

Results Top 30 compile times with `has_nulls` converted to a runtime parameter. ``` 994075 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o 970520 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o 949222 CMakeFiles/cudf_base.dir/src/sort/sort.cu.o 944946 CMakeFiles/cudf_base.dir/src/sort/stable_sort.cu.o 728829 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o 728487 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o 727735 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o 500829 CMakeFiles/cudf_rolling.dir/src/rolling/rolling.cu.o 487375 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o 473684 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o 452647 CMakeFiles/cudf_reductions.dir/src/reductions/scan.cu.o 391825 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o 391768 CMakeFiles/cudf_base.dir/src/strings/sorting/sorting.cu.o 311602 CMakeFiles/cudf_reductions.dir/src/reductions/var.cu.o 290547 CMakeFiles/cudf_reductions.dir/src/reductions/std.cu.o 259011 CMakeFiles/cudf_base.dir/src/unary/math_ops.cu.o 257266 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o 249444 CMakeFiles/cudf_base.dir/src/search/search.cu.o 219493 CMakeFiles/cudf_base.dir/src/stream_compaction/distinct_count.cu.o 214205 CMakeFiles/cudf_base.dir/src/quantiles/quantile.cu.o 185999 CMakeFiles/cudf_ast.dir/src/ast/transform.cu.o 180346 CMakeFiles/cudf_base.dir/src/stream_compaction/drop_duplicates.cu.o 179347 CMakeFiles/cudf_base.dir/src/copying/copy.cu.o 161075 CMakeFiles/cudf_base.dir/src/copying/gather.cu.o 152886 CMakeFiles/cudf_base.dir/src/groupby/sort/group_nunique.cu.o 149715 CMakeFiles/cudf_base.dir/src/sort/rank.cu.o 144347 CMakeFiles/cudf_base.dir/src/groupby/sort/sort_helper.cu.o 143281 CMakeFiles/cudf_partitioning.dir/src/partitioning/partitioning.cu.o 133652 CMakeFiles/cudf_base.dir/src/sort/is_sorted.cu.o 130779 tests/CMakeFiles/cudftestutil.dir/utilities/column_utilities.cu.o ``` ![has_nulls_removed](https://user-images.githubusercontent.com/45795991/101642164-a950fc00-3a00-11eb-9fe6-821dc9b1e0b4.png) The overall improvement in compile time is about 7.5 minutes (30 min -> 22.5 min). The compile size of `libcudf_base.so` is reduced by about 33MB (287MB -> 254MB).

Only cudf::merge showed any significant performance drop but since merge implements its own comparator functors, re-instating it's has_nulls template implementation was not difficult. And merge.cu is not a significant compile time concern (ranked at 46). This means, globally removing this kind of parameter may not be necessary. My proposal is to remove them only from the row comparators and row hashers as appropriate for now.

This is a non-breaking change since it only effects internal functors and kernels functions.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.