Closed GregoryKimball closed 1 week ago
Related issue https://github.com/NVIDIA/spark-rapids/issues/7529
Is the low throughput specifically only for low cardinality data, or also for skewed cardinality data? That is, suppose I have many distinct group keys, but most of the weight of the data is only in a few. In that scenario, as you describe the implementation, I think we will still run slowly. However, dispatching based on absolute cardinality will not be that useful since it is not a true representation of the data distribution.
What is the throughput difference between hyperloglog and count distinct? We expect the memory footprint of hyperloglog to be much lower, but I don't believe throughput has had controlled measurements.
Here are the newest benchmark numbers for HLL: https://github.com/NVIDIA/cuCollections/pull/429#issuecomment-1999730719 tl;dr Depending on the parameter, we achieve between 72-89% of the SOL memory bandwidth of an H100.
The memory footprint is only a few KB or 1MB max and depends on the precision value. More precisely it is 4 * 2^precision
bytes, where the precision is typically in range [10, 18]. So yes, it should be much smaller compared to a fully fledged exact distinct count algorithm.
Using this issue to continue the discussions in https://github.com/NVIDIA/spark-rapids/issues/7529.
Low cardinality join and low cardinality groupby share some commons but the performance bottlenecks are not the same:
With that, to address this issue: For groupby:
for high-multiplicity hash join:
Hello, let's please continue the discussion of high-multiplicity joins in https://github.com/rapidsai/cudf/issues/16025, and continue the discussion of low-cardinality groupby here in this issue.
Shared memory hash maps are in progress here in cuco https://github.com/NVIDIA/cuCollections/pull/551, but this work only impacts the insert_or_apply
API in cuco.
We will need separate changes in https://github.com/rapidsai/cudf/blob/branch-24.10/cpp/src/groupby/hash/groupby.cu to add shared memory optimizations.
Is your feature request related to a problem? Please describe. libcudf aggregations show lower throughput when data cardinality is less than ~1000 distinct values. This is due to serializing atomic operations over a small range of global memory. We received some projections that use hash maps that begin in shared memory and then spill to global if they exceed a certain size. The projections indicate 2-10x speedup for cardinalities below 100.
(Aggregation throughput data was collected for groupby max over 20M rows of int64 key and int64 payload, based on benchmarks introduced in https://github.com/rapidsai/cudf/pull/15134 and using A100 hardware. Projections were provided as speedup versus cardinality data and were applied to the A100 measured throughput to yield projected throughput.)
Describe the solution you'd like We could provide an implementation that uses shared memory hash maps when cardinality is low. Shared memory as storage is supported in cuCollections, so we could leverage this option to offer a higher throughput code path when cardinality is low.
As far as the API design, we could add an optional
cardinality
parameter to theaggregate
API. When hyperloglog cardinality estimates are available in cuCollections, we may want to support cardinality estimates as well. Some open questions include:Describe alternatives you've considered We aren't sure how common low cardinality aggregation keys are in customer workloads. Are there cases where cardinality will be known ahead of time, or will it always need to be computed or estimated before triggering the aggregation? Could we instrument NDS to log cardinality and row count before each aggregation node?
Additional context We could also consider using shared memory hash maps for low-cardinality distinct-key joins. This optimization is mentioned in https://github.com/rapidsai/cudf/issues/14948.