[BUG] Performance regression in cuDF after #14679

abellina commented 9 months ago

We are seeing some really big regressions in our NDS benchmarks after this PR went in: https://github.com/rapidsai/cudf/pull/14679. We verified that reverting this commit fixes the regression.

Platform: 'Spark 2a', Spark version: '3.3.1'
--------------------------------------------------------------------
Name = q14a
Means = 9224.6, 10406.8
Time diff = -1182.199999999999
Speedup = 0.8864011992158974
T-Test (test statistic, p value, df) = -4.761890107264471, 0.0014233443564802676, 8.0
T-Test Confidence Interval = -1754.6949604102729, -609.7050395897251
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = q23a
Means = 17216.8, 18474.4
Time diff = -1257.6000000000022
Speedup = 0.9319274238946866
T-Test (test statistic, p value, df) = -5.767490738726307, 0.00042048826337266765, 8.0
T-Test Confidence Interval = -1760.4236596455162, -754.7763403544881
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = q67
Means = 26479.4, 3621773.8
Time diff = -3595294.4
Speedup = 0.0073111689084503296
T-Test (test statistic, p value, df) = -200.13173437293509, 4.348886771193257e-16, 8.0
T-Test Confidence Interval = -3636720.9322740515, -3553867.8677259483
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = q81
Means = 3188.2, 4205.8
Time diff = -1017.6000000000004
Speedup = 0.7580484093394835
T-Test (test statistic, p value, df) = -6.738351796028844, 0.0001467977875603567, 8.0
T-Test Confidence Interval = -1365.843884979839, -669.3561150201617
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------

When running an nsys trace, and looking at traces, it's all in the groupby, specifically in this for_each call:

void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::for_each_f<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<cudf::groupby::detail::hash::compute_single_pass_aggs_fn<concurrent_unordered_map<int, int, cudf::experimental::row::hash::device_row_hasher<cudf::hashing::detail::default_hash, cudf::nullate::DYNAMIC>, const cudf::experimental::row::equality::device_row_comparator<(bool)0, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, default_allocator<thrust::pair<int, int>>>>, void>>, int>, thrust::cuda_cub::for_each_f<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<cudf::groupby::detail::hash::compute_single_pass_aggs_fn<concurrent_unordered_map<int, int, cudf::experimental::row::hash::device_row_hasher<cudf::hashing::detail::default_hash, cudf::nullate::DYNAMIC>, const cudf::experimental::row::equality::device_row_comparator<(bool)0, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, default_allocator<thrust::pair<int, int>>>>, void>>, int>(T2, T3)
Begins: 82.6093s
Ends: 93.4495s (+10.840 s)
grid:  <<<17966, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 68
Local Memory Per Thread: 0 bytes
Local Memory Total: 155,713,536 bytes
Shared Memory executed: 32,768 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 37.5 %
Launched from thread: 3473426
Latency: ←7.040 μs
Correlation ID: 760769
Stream: Stream 52

GregoryKimball commented 9 months ago

@SurajAralihalli would you do a quick check to see if there is a regression in our groupby nvbenchmarks?

abellina commented 9 months ago

These look to be simple sum aggregations on either uint64_t columns or int64_t columns. The actual spark type is a decimal, which we chunk into 4 components to perform overflow checking, but cuDF is operating on uint64_t/int64_t only here.

SurajAralihalli commented 9 months ago

@SurajAralihalli would you do a quick check to see if there is a regression in our groupby nvbenchmarks?

I don't observe a significant variance in the groupby nvbenchmarks. I'll investigate further.

# groupby_nunique

## [0] NVIDIA RTX A5000

|  T  |  num_rows  |  null_probability  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-----|------------|--------------------|------------|-------------|------------|-------------|-------------|---------|----------|
| I32 |    2^12    |         0          | 324.202 us |       1.27% | 323.106 us |       1.28% |   -1.097 us |  -0.34% |   PASS   |
| I32 |    2^16    |         0          | 493.699 us |       0.89% | 500.726 us |       0.98% |    7.027 us |   1.42% |   FAIL   |
| I32 |    2^20    |         0          |   2.859 ms |       0.77% |   2.861 ms |       0.80% |    2.629 us |   0.09% |   PASS   |
| I32 |    2^24    |         0          |  54.140 ms |       0.41% |  54.183 ms |       0.46% |   43.101 us |   0.08% |   PASS   |
| I32 |    2^12    |        0.5         | 401.505 us |       1.82% | 402.353 us |       2.01% |    0.848 us |   0.21% |   PASS   |
| I32 |    2^16    |        0.5         | 634.529 us |       0.70% | 642.765 us |       0.79% |    8.236 us |   1.30% |   FAIL   |
| I32 |    2^20    |        0.5         |   2.960 ms |       1.20% |   2.953 ms |       1.39% |   -7.593 us |  -0.26% |   PASS   |
| I32 |    2^24    |        0.5         |  47.267 ms |       0.39% |  47.113 ms |       0.22% | -153.787 us |  -0.33% |   FAIL   |
| I64 |    2^12    |         0          | 396.544 us |       2.11% | 393.066 us |       2.35% |   -3.478 us |  -0.88% |   PASS   |
| I64 |    2^16    |         0          | 574.812 us |       1.03% | 570.253 us |       0.59% |   -4.559 us |  -0.79% |   FAIL   |
| I64 |    2^20    |         0          |   3.691 ms |       0.88% |   3.691 ms |       0.80% |   -0.588 us |  -0.02% |   PASS   |
| I64 |    2^24    |         0          |  69.679 ms |       0.14% |  69.604 ms |       0.16% |  -74.566 us |  -0.11% |   PASS   |
| I64 |    2^12    |        0.5         | 453.885 us |       1.54% | 453.695 us |       1.51% |   -0.190 us |  -0.04% |   PASS   |
| I64 |    2^16    |        0.5         | 695.165 us |       0.67% | 693.459 us |       0.67% |   -1.706 us |  -0.25% |   PASS   |
| I64 |    2^20    |        0.5         |   3.253 ms |       1.06% |   3.259 ms |       1.19% |    6.656 us |   0.20% |   PASS   |
| I64 |    2^24    |        0.5         |  52.952 ms |       0.27% |  52.870 ms |       0.25% |  -82.386 us |  -0.16% |   PASS   |

# groupby_rank

## [0] NVIDIA RTX A5000

|  rank_method  |  data_size  |  is_sorted  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------------|-------------|-------------|------------|-------------|------------|-------------|-------------|---------|----------|
|    AVERAGE    |   1000000   |      0      |   3.407 ms |       0.77% |   3.413 ms |       1.22% |    6.699 us |   0.20% |   PASS   |
|    AVERAGE    |  10000000   |      0      |  30.715 ms |       0.23% |  30.674 ms |       0.33% |  -41.020 us |  -0.13% |   PASS   |
|    AVERAGE    |  100000000  |      0      | 349.385 ms |       0.06% | 348.759 ms |       0.10% | -626.409 us |  -0.18% |   FAIL   |
|    AVERAGE    |   1000000   |      1      | 312.883 us |       1.22% | 311.793 us |       1.30% |   -1.089 us |  -0.35% |   PASS   |
|    AVERAGE    |  10000000   |      1      |   2.151 ms |       0.21% |   2.150 ms |       0.24% |   -0.593 us |  -0.03% |   PASS   |
|    AVERAGE    |  100000000  |      1      |  20.555 ms |       0.05% |  20.559 ms |       0.04% |    3.891 us |   0.02% |   PASS   |
|     DENSE     |   1000000   |      0      |   3.275 ms |       0.77% |   3.289 ms |       1.21% |   13.529 us |   0.41% |   PASS   |
|     DENSE     |  10000000   |      0      |  29.757 ms |       0.28% |  29.817 ms |       0.19% |   60.778 us |   0.20% |   FAIL   |
|     DENSE     |  100000000  |      0      | 338.734 ms |       0.09% | 338.521 ms |       0.10% | -212.527 us |  -0.06% |   PASS   |
|     DENSE     |   1000000   |      1      | 166.441 us |       1.81% | 166.363 us |       1.80% |   -0.078 us |  -0.05% |   PASS   |
|     DENSE     |  10000000   |      1      |   1.166 ms |       0.32% |   1.165 ms |       0.25% |   -1.340 us |  -0.11% |   PASS   |
|     DENSE     |  100000000  |      1      |  11.080 ms |       0.05% |  11.080 ms |       0.06% |   -0.089 us |  -0.00% |   PASS   |
|     FIRST     |   1000000   |      0      |   3.223 ms |       0.73% |   3.229 ms |       1.13% |    6.633 us |   0.21% |   PASS   |
|     FIRST     |  10000000   |      0      |  29.518 ms |       0.34% |  29.534 ms |       0.32% |   16.806 us |   0.06% |   PASS   |
|     FIRST     |  100000000  |      0      | 335.483 ms |       0.12% | 334.901 ms |       0.15% | -581.726 us |  -0.17% |   FAIL   |
|     FIRST     |   1000000   |      1      | 106.754 us |       1.95% | 106.871 us |       2.49% |    0.117 us |   0.11% |   PASS   |
|     FIRST     |  10000000   |      1      | 791.476 us |       0.29% | 791.192 us |       0.31% |   -0.284 us |  -0.04% |   PASS   |
|     FIRST     |  100000000  |      1      |   7.571 ms |       0.07% |   7.571 ms |       0.09% |   -0.015 us |  -0.00% |   PASS   |
|      MAX      |   1000000   |      0      |   3.291 ms |       0.85% |   3.302 ms |       1.27% |   10.985 us |   0.33% |   PASS   |
|      MAX      |  10000000   |      0      |  29.963 ms |       0.26% |  30.076 ms |       0.27% |  112.820 us |   0.38% |   FAIL   |
|      MAX      |  100000000  |      0      | 340.774 ms |       0.18% | 340.540 ms |       0.17% | -233.937 us |  -0.07% |   PASS   |
|      MAX      |   1000000   |      1      | 161.971 us |       2.36% | 161.766 us |       2.29% |   -0.204 us |  -0.13% |   PASS   |
|      MAX      |  10000000   |      1      |   1.168 ms |       0.31% |   1.167 ms |       0.50% |   -0.949 us |  -0.08% |   PASS   |
|      MAX      |  100000000  |      1      |  11.191 ms |       0.07% |  11.192 ms |       0.09% |    0.956 us |   0.01% |   PASS   |
|      MIN      |   1000000   |      0      |   3.300 ms |       0.80% |   3.306 ms |       1.21% |    5.953 us |   0.18% |   PASS   |
|      MIN      |  10000000   |      0      |  30.032 ms |       0.30% |  30.058 ms |       0.22% |   26.444 us |   0.09% |   PASS   |
|      MIN      |  100000000  |      0      | 341.492 ms |       0.15% | 340.956 ms |       0.21% | -535.645 us |  -0.16% |   FAIL   |
|      MIN      |   1000000   |      1      | 167.073 us |       1.89% | 165.808 us |       1.89% |   -1.265 us |  -0.76% |   PASS   |
|      MIN      |  10000000   |      1      |   1.167 ms |       0.25% |   1.167 ms |       0.39% |    0.468 us |   0.04% |   PASS   |
|      MIN      |  100000000  |      1      |  11.087 ms |       0.07% |  11.085 ms |       0.07% |   -2.026 us |  -0.02% |   PASS   |

# Summary

- Total Matches: 100
  - Pass    (diff <= min_noise): 91
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  9

jlowe commented 9 months ago

I have a repro case outside of Spark just using the cudf Java APIs where an aggregation call is taking over 15 seconds. I'll see if I can get this boiled down to a pure C++ repro.

jlowe commented 9 months ago

For reference, here's the repro steps that I have using just the cudf Java APIs (still working on the C++-only repro). Attached is a Parquet file with the data used in this test which consists of a two columns, a string column and a UINT32 column.

import ai.rapids.cudf.*;

Table t = Table.readParquet(new java.io.File("testdata.parquet"));
Table.GroupByOperation g = t.groupBy(0);
Table result = g.aggregate(GroupByAggregation.sum().onColumn(1));

testdata.parquet.gz

The test simply groups by the string column and does a sum aggregation on the UINT32 column. It takes a few milliseconds to run this before #14679 and over 15 seconds afterwards.

jlowe commented 9 months ago

Finally got the C++ repro for this. This program runs sub-second before #14679 and takes tens of seconds afterwards:

#include <cudf/groupby.hpp>
#include <cudf/io/parquet.hpp>

int main(int argc, char** argv) {
  auto read_opts = cudf::io::parquet_reader_options_builder(cudf::io::source_info{"./testdata.parquet"}).build();
  auto read_result = cudf::io::read_parquet(read_opts);
  auto t = read_result.tbl->view();
  cudf::groupby::groupby grouper(cudf::table_view({t.column(0)}), cudf::null_policy::INCLUDE, cudf::sorted::NO);
  std::vector<cudf::groupby::aggregation_request> requests;
  requests.emplace_back(cudf::groupby::aggregation_request());
  requests[0].values = t.column(1);
  requests[0].aggregations.push_back(cudf::make_sum_aggregation<cudf::groupby_aggregation>());
  auto result = grouper.aggregate(requests);
  return 0;
}

GregoryKimball commented 9 months ago

Thank you @jlowe for sharing this clear repro. Would you please share a bit about the contents of the parquet input file?

I pulled together a snapshot of our recent microbenchmarks and they do not show the sign of a large regression. There must be a case we are missing.

jlowe commented 9 months ago

I didn't closely examine the contents of the parquet file. This is just data captured in the middle of query execution from an NDS run at scale factor 100g just before a long aggregation call. The original data had 9 grouping columns and 5 aggregation columns, but I was able to still reproduce it with just one grouping column and one aggregation column. I did not try reducing the number of rows. It may not be sensitive to the data at all, but I wanted to post what I had so far.

jlowe commented 9 months ago

Also note that the file is attached at this comment: https://github.com/rapidsai/cudf/issues/14886#issuecomment-1911057728 so anyone can download the data, compile the test program, and reproduce the results.

SurajAralihalli commented 9 months ago

Thanks Json, the performance regression can be seen in bench_groupby_nvsum2 benchmarks for uint32 and uint64 types without parquet read. Once resolved, we can incorporate it into cudf benchmarks.

|  T  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |       %Diff |  Status  |
|-----|------------|-------------|------------|-------------|------------|-------------|----------|
| I32 |   1.640 ms |       2.89% |   1.632 ms |       6.12% |  -7.969 us |      -0.49% |   PASS   |
| U32 |   1.630 ms |       1.82% |   17.867 s |        inf% |   17.866 s | 1096222.72% |   FAIL   |
| I64 |   1.763 ms |      24.22% |   1.678 ms |      14.77% | -84.945 us |      -4.82% |   PASS   |
| U64 |   1.715 ms |      20.62% |   17.942 s |        inf% |   17.941 s | 1046062.96% |   FAIL   |

GregoryKimball commented 9 months ago

I believe this is safe to close

rapidsai / cudf

[BUG] Performance regression in cuDF after #14679 #14886