Open GregoryKimball opened 3 months ago
Related issue https://github.com/NVIDIA/spark-rapids/issues/7529
Is the low throughput specifically only for low cardinality data, or also for skewed cardinality data? That is, suppose I have many distinct group keys, but most of the weight of the data is only in a few. In that scenario, as you describe the implementation, I think we will still run slowly. However, dispatching based on absolute cardinality will not be that useful since it is not a true representation of the data distribution.
What is the throughput difference between hyperloglog and count distinct? We expect the memory footprint of hyperloglog to be much lower, but I don't believe throughput has had controlled measurements.
Here are the newest benchmark numbers for HLL: https://github.com/NVIDIA/cuCollections/pull/429#issuecomment-1999730719 tl;dr Depending on the parameter, we achieve between 72-89% of the SOL memory bandwidth of an H100.
The memory footprint is only a few KB or 1MB max and depends on the precision value. More precisely it is 4 * 2^precision
bytes, where the precision is typically in range [10, 18]. So yes, it should be much smaller compared to a fully fledged exact distinct count algorithm.
Using this issue to continue the discussions in https://github.com/NVIDIA/spark-rapids/issues/7529.
Low cardinality join and low cardinality groupby share some commons but the performance bottlenecks are not the same:
With that, to address this issue: For groupby:
for high-multiplicity hash join:
Hello, let's please continue the discussion of high-multiplicity joins in https://github.com/rapidsai/cudf/issues/16025, and continue the discussion of low-cardinality groupby here in this issue.
Is your feature request related to a problem? Please describe. libcudf aggregations show lower throughput when data cardinality is less than ~1000 distinct values. This is due to serializing atomic operations over a small range of global memory. We received some projections that use hash maps that begin in shared memory and then spill to global if they exceed a certain size. The projections indicate 2-10x speedup for cardinalities below 100.
Describe the solution you'd like We could provide an implementation that uses shared memory hash maps when cardinality is low. Shared memory as storage is supported in cuCollections, so we could leverage this option to offer a higher throughput code path when cardinality is low.
As far as the API design, we could add an optional
cardinality
parameter to theaggregate
API. When hyperloglog cardinality estimates are available in cuCollections, we may want to support cardinality estimates as well. Some open questions include:Describe alternatives you've considered We aren't sure how common low cardinality aggregation keys are in customer workloads. Are there cases where cardinality will be known ahead of time, or will it always need to be computed or estimated before triggering the aggregation? Could we instrument NDS to log cardinality and row count before each aggregation node?
Additional context We could also consider using shared memory hash maps for low-cardinality distinct-key joins. This optimization is mentioned in https://github.com/rapidsai/cudf/issues/14948.