Closed abellina closed 9 months ago
@SurajAralihalli would you do a quick check to see if there is a regression in our groupby nvbenchmarks?
These look to be simple sum aggregations on either uint64_t columns or int64_t columns. The actual spark type is a decimal, which we chunk into 4 components to perform overflow checking, but cuDF is operating on uint64_t/int64_t only here.
@SurajAralihalli would you do a quick check to see if there is a regression in our groupby nvbenchmarks?
I don't observe a significant variance in the groupby nvbenchmarks. I'll investigate further.
# groupby_nunique
## [0] NVIDIA RTX A5000
| T | num_rows | null_probability | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-----|------------|--------------------|------------|-------------|------------|-------------|-------------|---------|----------|
| I32 | 2^12 | 0 | 324.202 us | 1.27% | 323.106 us | 1.28% | -1.097 us | -0.34% | PASS |
| I32 | 2^16 | 0 | 493.699 us | 0.89% | 500.726 us | 0.98% | 7.027 us | 1.42% | FAIL |
| I32 | 2^20 | 0 | 2.859 ms | 0.77% | 2.861 ms | 0.80% | 2.629 us | 0.09% | PASS |
| I32 | 2^24 | 0 | 54.140 ms | 0.41% | 54.183 ms | 0.46% | 43.101 us | 0.08% | PASS |
| I32 | 2^12 | 0.5 | 401.505 us | 1.82% | 402.353 us | 2.01% | 0.848 us | 0.21% | PASS |
| I32 | 2^16 | 0.5 | 634.529 us | 0.70% | 642.765 us | 0.79% | 8.236 us | 1.30% | FAIL |
| I32 | 2^20 | 0.5 | 2.960 ms | 1.20% | 2.953 ms | 1.39% | -7.593 us | -0.26% | PASS |
| I32 | 2^24 | 0.5 | 47.267 ms | 0.39% | 47.113 ms | 0.22% | -153.787 us | -0.33% | FAIL |
| I64 | 2^12 | 0 | 396.544 us | 2.11% | 393.066 us | 2.35% | -3.478 us | -0.88% | PASS |
| I64 | 2^16 | 0 | 574.812 us | 1.03% | 570.253 us | 0.59% | -4.559 us | -0.79% | FAIL |
| I64 | 2^20 | 0 | 3.691 ms | 0.88% | 3.691 ms | 0.80% | -0.588 us | -0.02% | PASS |
| I64 | 2^24 | 0 | 69.679 ms | 0.14% | 69.604 ms | 0.16% | -74.566 us | -0.11% | PASS |
| I64 | 2^12 | 0.5 | 453.885 us | 1.54% | 453.695 us | 1.51% | -0.190 us | -0.04% | PASS |
| I64 | 2^16 | 0.5 | 695.165 us | 0.67% | 693.459 us | 0.67% | -1.706 us | -0.25% | PASS |
| I64 | 2^20 | 0.5 | 3.253 ms | 1.06% | 3.259 ms | 1.19% | 6.656 us | 0.20% | PASS |
| I64 | 2^24 | 0.5 | 52.952 ms | 0.27% | 52.870 ms | 0.25% | -82.386 us | -0.16% | PASS |
# groupby_rank
## [0] NVIDIA RTX A5000
| rank_method | data_size | is_sorted | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|---------------|-------------|-------------|------------|-------------|------------|-------------|-------------|---------|----------|
| AVERAGE | 1000000 | 0 | 3.407 ms | 0.77% | 3.413 ms | 1.22% | 6.699 us | 0.20% | PASS |
| AVERAGE | 10000000 | 0 | 30.715 ms | 0.23% | 30.674 ms | 0.33% | -41.020 us | -0.13% | PASS |
| AVERAGE | 100000000 | 0 | 349.385 ms | 0.06% | 348.759 ms | 0.10% | -626.409 us | -0.18% | FAIL |
| AVERAGE | 1000000 | 1 | 312.883 us | 1.22% | 311.793 us | 1.30% | -1.089 us | -0.35% | PASS |
| AVERAGE | 10000000 | 1 | 2.151 ms | 0.21% | 2.150 ms | 0.24% | -0.593 us | -0.03% | PASS |
| AVERAGE | 100000000 | 1 | 20.555 ms | 0.05% | 20.559 ms | 0.04% | 3.891 us | 0.02% | PASS |
| DENSE | 1000000 | 0 | 3.275 ms | 0.77% | 3.289 ms | 1.21% | 13.529 us | 0.41% | PASS |
| DENSE | 10000000 | 0 | 29.757 ms | 0.28% | 29.817 ms | 0.19% | 60.778 us | 0.20% | FAIL |
| DENSE | 100000000 | 0 | 338.734 ms | 0.09% | 338.521 ms | 0.10% | -212.527 us | -0.06% | PASS |
| DENSE | 1000000 | 1 | 166.441 us | 1.81% | 166.363 us | 1.80% | -0.078 us | -0.05% | PASS |
| DENSE | 10000000 | 1 | 1.166 ms | 0.32% | 1.165 ms | 0.25% | -1.340 us | -0.11% | PASS |
| DENSE | 100000000 | 1 | 11.080 ms | 0.05% | 11.080 ms | 0.06% | -0.089 us | -0.00% | PASS |
| FIRST | 1000000 | 0 | 3.223 ms | 0.73% | 3.229 ms | 1.13% | 6.633 us | 0.21% | PASS |
| FIRST | 10000000 | 0 | 29.518 ms | 0.34% | 29.534 ms | 0.32% | 16.806 us | 0.06% | PASS |
| FIRST | 100000000 | 0 | 335.483 ms | 0.12% | 334.901 ms | 0.15% | -581.726 us | -0.17% | FAIL |
| FIRST | 1000000 | 1 | 106.754 us | 1.95% | 106.871 us | 2.49% | 0.117 us | 0.11% | PASS |
| FIRST | 10000000 | 1 | 791.476 us | 0.29% | 791.192 us | 0.31% | -0.284 us | -0.04% | PASS |
| FIRST | 100000000 | 1 | 7.571 ms | 0.07% | 7.571 ms | 0.09% | -0.015 us | -0.00% | PASS |
| MAX | 1000000 | 0 | 3.291 ms | 0.85% | 3.302 ms | 1.27% | 10.985 us | 0.33% | PASS |
| MAX | 10000000 | 0 | 29.963 ms | 0.26% | 30.076 ms | 0.27% | 112.820 us | 0.38% | FAIL |
| MAX | 100000000 | 0 | 340.774 ms | 0.18% | 340.540 ms | 0.17% | -233.937 us | -0.07% | PASS |
| MAX | 1000000 | 1 | 161.971 us | 2.36% | 161.766 us | 2.29% | -0.204 us | -0.13% | PASS |
| MAX | 10000000 | 1 | 1.168 ms | 0.31% | 1.167 ms | 0.50% | -0.949 us | -0.08% | PASS |
| MAX | 100000000 | 1 | 11.191 ms | 0.07% | 11.192 ms | 0.09% | 0.956 us | 0.01% | PASS |
| MIN | 1000000 | 0 | 3.300 ms | 0.80% | 3.306 ms | 1.21% | 5.953 us | 0.18% | PASS |
| MIN | 10000000 | 0 | 30.032 ms | 0.30% | 30.058 ms | 0.22% | 26.444 us | 0.09% | PASS |
| MIN | 100000000 | 0 | 341.492 ms | 0.15% | 340.956 ms | 0.21% | -535.645 us | -0.16% | FAIL |
| MIN | 1000000 | 1 | 167.073 us | 1.89% | 165.808 us | 1.89% | -1.265 us | -0.76% | PASS |
| MIN | 10000000 | 1 | 1.167 ms | 0.25% | 1.167 ms | 0.39% | 0.468 us | 0.04% | PASS |
| MIN | 100000000 | 1 | 11.087 ms | 0.07% | 11.085 ms | 0.07% | -2.026 us | -0.02% | PASS |
# Summary
- Total Matches: 100
- Pass (diff <= min_noise): 91
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 9
I have a repro case outside of Spark just using the cudf Java APIs where an aggregation call is taking over 15 seconds. I'll see if I can get this boiled down to a pure C++ repro.
For reference, here's the repro steps that I have using just the cudf Java APIs (still working on the C++-only repro). Attached is a Parquet file with the data used in this test which consists of a two columns, a string column and a UINT32 column.
import ai.rapids.cudf.*;
Table t = Table.readParquet(new java.io.File("testdata.parquet"));
Table.GroupByOperation g = t.groupBy(0);
Table result = g.aggregate(GroupByAggregation.sum().onColumn(1));
The test simply groups by the string column and does a sum aggregation on the UINT32 column. It takes a few milliseconds to run this before #14679 and over 15 seconds afterwards.
Finally got the C++ repro for this. This program runs sub-second before #14679 and takes tens of seconds afterwards:
#include <cudf/groupby.hpp>
#include <cudf/io/parquet.hpp>
int main(int argc, char** argv) {
auto read_opts = cudf::io::parquet_reader_options_builder(cudf::io::source_info{"./testdata.parquet"}).build();
auto read_result = cudf::io::read_parquet(read_opts);
auto t = read_result.tbl->view();
cudf::groupby::groupby grouper(cudf::table_view({t.column(0)}), cudf::null_policy::INCLUDE, cudf::sorted::NO);
std::vector<cudf::groupby::aggregation_request> requests;
requests.emplace_back(cudf::groupby::aggregation_request());
requests[0].values = t.column(1);
requests[0].aggregations.push_back(cudf::make_sum_aggregation<cudf::groupby_aggregation>());
auto result = grouper.aggregate(requests);
return 0;
}
Thank you @jlowe for sharing this clear repro. Would you please share a bit about the contents of the parquet input file?
I pulled together a snapshot of our recent microbenchmarks and they do not show the sign of a large regression. There must be a case we are missing.
I didn't closely examine the contents of the parquet file. This is just data captured in the middle of query execution from an NDS run at scale factor 100g just before a long aggregation call. The original data had 9 grouping columns and 5 aggregation columns, but I was able to still reproduce it with just one grouping column and one aggregation column. I did not try reducing the number of rows. It may not be sensitive to the data at all, but I wanted to post what I had so far.
Also note that the file is attached at this comment: https://github.com/rapidsai/cudf/issues/14886#issuecomment-1911057728 so anyone can download the data, compile the test program, and reproduce the results.
Thanks Json, the performance regression can be seen in bench_groupby_nvsum2 benchmarks for uint32
and uint64
types without parquet read. Once resolved, we can incorporate it into cudf benchmarks.
| T | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-----|------------|-------------|------------|-------------|------------|-------------|----------|
| I32 | 1.640 ms | 2.89% | 1.632 ms | 6.12% | -7.969 us | -0.49% | PASS |
| U32 | 1.630 ms | 1.82% | 17.867 s | inf% | 17.866 s | 1096222.72% | FAIL |
| I64 | 1.763 ms | 24.22% | 1.678 ms | 14.77% | -84.945 us | -4.82% | PASS |
| U64 | 1.715 ms | 20.62% | 17.942 s | inf% | 17.941 s | 1046062.96% | FAIL |
I believe this is safe to close
We are seeing some really big regressions in our NDS benchmarks after this PR went in: https://github.com/rapidsai/cudf/pull/14679. We verified that reverting this commit fixes the regression.
When running an nsys trace, and looking at traces, it's all in the groupby, specifically in this for_each call: