Closed bdice closed 1 week ago
Here's a better reproducer. This works on systems with V100 GPUs or others with less memory than an A100:
import rmm
import cudf
import cupy
rmm.reinitialize(managed_memory=True)
print("Creating series...")
nrows = 1_100_000_000
series = cudf.Series(cupy.random.randint(0, nrows, nrows, dtype='int64'))
print("Calling value_counts (hangs here for large input)...")
series.value_counts()
This calls through a groupby-count.
The root cause is tracked in https://github.com/NVIDIA/cccl/issues/1422.
A more immediate fix may be possible in cuCollections: https://github.com/NVIDIA/cuCollections/issues/576
On the cuDF side, we could avoid this problem by setting a maximum cuCo set/map size of INT_MAX
. Since we already have a size_type
limit on the number of rows in cuDF, it's not possible for us to have a cardinality greater than 2B. We would just increase the effective occupancy as we go from INT_MAX/2
to INT_MAX
inputs, which could have performance impact. @PointKernel Do you foresee any issues with that approach?
We would just increase the effective occupancy as we go from INT_MAX/2 to INT_MAX inputs, which could have performance impact.
The performance impact can be super scary due to high collisions
@PointKernel Do you foresee any issues with that approach?
I'm working on a PR to fix the large capacity issues from cuco's end. Most cuco operations are implemented via custom kernels so a large map/set won't be a problem. The issue unveiled here is because the underlying implementation is using CUB algorithms where the input size is hardcoded as int
. Luckily there are only very few places in cuco using CUB algorithms so the fix should be fast. I'm experimenting with several options like custom kernels, pipelining CUB invocations, and an intermediate solution mixing those two.
Once it's done, no effort is needed in libcudf.
Update: I tested https://github.com/rapidsai/cudf/pull/16572 locally and it seems to fix the issue.
I ran into what I believe is the same issue while debugging a cugraph hang from a user. The input size was similar, but the cudf call that resulted in the hang was groupby()
. After updating to the latest cudf
having noticed the above rapids-cmake
/cuco
PR merged, the problem was resolved. @bdice , should this issue be closed?
I believe this is resolved now. Closing.
Describe the bug I observe a hang when running
Series.value_counts
with over 1B rows of input. It succeeds at 1 billion rows but fails at 1.1 billion rows. I suspect this is due to some kind of calculation that exceedssize_type
.Steps/Code to reproduce bug
Expected behavior Results are returned in 1-2 seconds for 1.1 billion rows, just like for 1 billion rows.