rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.24k stars 883 forks source link

[BUG] Series.value_counts hangs with over 1B rows of input #16526

Closed bdice closed 1 week ago

bdice commented 1 month ago

Describe the bug I observe a hang when running Series.value_counts with over 1B rows of input. It succeeds at 1 billion rows but fails at 1.1 billion rows. I suspect this is due to some kind of calculation that exceeds size_type.

Steps/Code to reproduce bug

import cudf
import cupy

nrows = 1_100_000_000
series = cudf.Series(cupy.random.randint(0, nrows, nrows, dtype='int64'))

series.value_counts()

Expected behavior Results are returned in 1-2 seconds for 1.1 billion rows, just like for 1 billion rows.

bdice commented 1 month ago

Here's a better reproducer. This works on systems with V100 GPUs or others with less memory than an A100:

import rmm
import cudf
import cupy

rmm.reinitialize(managed_memory=True)

print("Creating series...")
nrows = 1_100_000_000
series = cudf.Series(cupy.random.randint(0, nrows, nrows, dtype='int64'))

print("Calling value_counts (hangs here for large input)...")
series.value_counts()
bdice commented 1 month ago

This calls through a groupby-count.

bdice commented 1 month ago

The root cause is tracked in https://github.com/NVIDIA/cccl/issues/1422.

A more immediate fix may be possible in cuCollections: https://github.com/NVIDIA/cuCollections/issues/576

bdice commented 1 month ago

On the cuDF side, we could avoid this problem by setting a maximum cuCo set/map size of INT_MAX. Since we already have a size_type limit on the number of rows in cuDF, it's not possible for us to have a cardinality greater than 2B. We would just increase the effective occupancy as we go from INT_MAX/2 to INT_MAX inputs, which could have performance impact. @PointKernel Do you foresee any issues with that approach?

PointKernel commented 1 month ago

We would just increase the effective occupancy as we go from INT_MAX/2 to INT_MAX inputs, which could have performance impact.

The performance impact can be super scary due to high collisions

@PointKernel Do you foresee any issues with that approach?

I'm working on a PR to fix the large capacity issues from cuco's end. Most cuco operations are implemented via custom kernels so a large map/set won't be a problem. The issue unveiled here is because the underlying implementation is using CUB algorithms where the input size is hardcoded as int. Luckily there are only very few places in cuco using CUB algorithms so the fix should be fast. I'm experimenting with several options like custom kernels, pipelining CUB invocations, and an intermediate solution mixing those two.

Once it's done, no effort is needed in libcudf.

PointKernel commented 1 month ago

Update: I tested https://github.com/rapidsai/cudf/pull/16572 locally and it seems to fix the issue.

rlratzel commented 1 week ago

I ran into what I believe is the same issue while debugging a cugraph hang from a user. The input size was similar, but the cudf call that resulted in the hang was groupby(). After updating to the latest cudf having noticed the above rapids-cmake/cuco PR merged, the problem was resolved. @bdice , should this issue be closed?

bdice commented 1 week ago

I believe this is resolved now. Closing.