[FEA] Add shared memory hash map for low-cardinality aggregations

GregoryKimball commented 3 months ago

Is your feature request related to a problem? Please describe. libcudf aggregations show lower throughput when data cardinality is less than ~1000 distinct values. This is due to serializing atomic operations over a small range of global memory. We received some projections that use hash maps that begin in shared memory and then spill to global if they exceed a certain size. The projections indicate 2-10x speedup for cardinalities below 100.

(Aggregation throughput data was collected for groupby max over 20M rows of int64 key and int64 payload, based on benchmarks introduced in https://github.com/rapidsai/cudf/pull/15134 and using A100 hardware. Projections were provided as speedup versus cardinality data and were applied to the A100 measured throughput to yield projected throughput.)

Describe the solution you'd like We could provide an implementation that uses shared memory hash maps when cardinality is low. Shared memory as storage is supported in cuCollections, so we could leverage this option to offer a higher throughput code path when cardinality is low.

As far as the API design, we could add an optional cardinality parameter to the aggregate API. When hyperloglog cardinality estimates are available in cuCollections, we may want to support cardinality estimates as well. Some open questions include:

What is the throughput difference between hyperloglog and count distinct? We expect the memory footprint of hyperloglog to be much lower, but I don't believe throughput has had controlled measurements.
If we accept cardinality estimates, what happens if the cardinality is underestimated and the shared memory hash map fails?
Does it make sense for column objects to track cardinality, or should the application layer track cardinality?

Describe alternatives you've considered We aren't sure how common low cardinality aggregation keys are in customer workloads. Are there cases where cardinality will be known ahead of time, or will it always need to be computed or estimated before triggering the aggregation? Could we instrument NDS to log cardinality and row count before each aggregation node?

Additional context We could also consider using shared memory hash maps for low-cardinality distinct-key joins. This optimization is mentioned in https://github.com/rapidsai/cudf/issues/14948.

sameerz commented 3 months ago

wence- commented 3 months ago

Is the low throughput specifically only for low cardinality data, or also for skewed cardinality data? That is, suppose I have many distinct group keys, but most of the weight of the data is only in a few. In that scenario, as you describe the implementation, I think we will still run slowly. However, dispatching based on absolute cardinality will not be that useful since it is not a true representation of the data distribution.

sleeepyjack commented 3 months ago

What is the throughput difference between hyperloglog and count distinct? We expect the memory footprint of hyperloglog to be much lower, but I don't believe throughput has had controlled measurements.

Here are the newest benchmark numbers for HLL: https://github.com/NVIDIA/cuCollections/pull/429#issuecomment-1999730719 tl;dr Depending on the parameter, we achieve between 72-89% of the SOL memory bandwidth of an H100.

The memory footprint is only a few KB or 1MB max and depends on the precision value. More precisely it is 4 * 2^precision bytes, where the precision is typically in range [10, 18]. So yes, it should be much smaller compared to a fully fledged exact distinct count algorithm.

PointKernel commented 1 month ago

Using this issue to continue the discussions in https://github.com/NVIDIA/spark-rapids/issues/7529.

Low cardinality join and low cardinality groupby share some commons but the performance bottlenecks are not the same:

Groupby uses a single map so the probing sequence (number of steps for a query key to find a match in hash table) is short (thus few hash collisions). The performance killer for low cardinality cases is the contention with atomic aggregations. Shared memory hash table can help since it splits the atomic aggregation into two levels: a block-level aggregation first and then only one aggregation per block with the final output.
Hash join uses a multimap so duplicate keys will result in a longer probing sequence thus more hash collisions. The culprit should be high multiplicity as opposed to low cardinality. I need to think more about this but a shared memory hash table probably won't help here.

With that, to address this issue: For groupby:

[ ] implement shared-memory hash table with two-level aggregation
[ ] evaluate the performance impact of using hyperloglog

for high-multiplicity hash join:

[ ] verify the impact of CG size tuning (https://github.com/NVIDIA/spark-rapids/issues/7529#issuecomment-2138286065). Expose CG size tuning in public APIs if needed

GregoryKimball commented 2 weeks ago

Hello, let's please continue the discussion of high-multiplicity joins in https://github.com/rapidsai/cudf/issues/16025, and continue the discussion of low-cardinality groupby here in this issue.

rapidsai / cudf

[FEA] Add shared memory hash map for low-cardinality aggregations #15262