rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 894 forks source link

[BUG] Random sampling with cupy does not support these inputs #12344

Open mlahir1 opened 1 year ago

mlahir1 commented 1 year ago

Weighted sampling with cudf series is documented as supported but throws the following error

Repro:

a = cudf.Series(range(100))
a.sample(n=10, weights=a.values_host, random_state=0)

or

a = cudf.Series(range(100))
a.sample(n=10, weights=a.values_host, random_state=0)

Exception:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
File /conda/envs/rapids-22.12/lib/python3.9/site-packages/cudf/core/indexed_frame.py:2952, in IndexedFrame._sample_axis_0(self, n, weights, replace, random_state, ignore_index)
   2951 try:
-> 2952     gather_map_array = random_state.choice(
   2953         len(self), size=n, replace=replace, p=weights
   2954     )
   2955 except NotImplementedError as e:

File /conda/envs/rapids-22.12/lib/python3.9/site-packages/cupy/random/_generator.py:1066, in RandomState.choice(self, a, size, replace, p)
   1065 if not replace:
-> 1066     raise NotImplementedError
   1068 if p is not None:

NotImplementedError: 

The above exception was the direct cause of the following exception:

NotImplementedError                       Traceback (most recent call last)
Cell In [25], line 2
      1 a = cudf.Series(range(100))
----> 2 a.sample(n=10, weights=a, random_state=0)

File /conda/envs/rapids-22.12/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /conda/envs/rapids-22.12/lib/python3.9/site-packages/cudf/core/indexed_frame.py:2930, in IndexedFrame.sample(self, n, frac, replace, weights, random_state, axis, ignore_index)
   2927     weights = weights / weights.sum()
   2929 if axis == 0:
-> 2930     return self._sample_axis_0(
   2931         n, weights, replace, random_state, ignore_index
   2932     )
   2933 else:
   2934     if isinstance(random_state, cp.random.RandomState):

File /conda/envs/rapids-22.12/lib/python3.9/site-packages/cudf/core/indexed_frame.py:2956, in IndexedFrame._sample_axis_0(self, n, weights, replace, random_state, ignore_index)
   2952     gather_map_array = random_state.choice(
   2953         len(self), size=n, replace=replace, p=weights
   2954     )
   2955 except NotImplementedError as e:
-> 2956     raise NotImplementedError(
   2957         "Random sampling with cupy does not support these inputs."
   2958     ) from e
   2960 return self._gather(
   2961     cudf.core.column.as_column(gather_map_array),
   2962     keep_index=not ignore_index,
   2963     check_bounds=False,
   2964 )

NotImplementedError: Random sampling with cupy does not support these inputs.
wence- commented 1 year ago

This is a current restriction of cupy, which we use to generate random samples: it is not possible to sample without replacement when using non-uniform weighting probabilities. I suspect therefore that the best way to get this if you need it is to ask there. It is possible to work-around this restriction by creating a supplementary array that gives each entry in your series the correct multiplicity and then sampling uniformly without replacement, but that might be a bit fiddly depending on use case.

GregoryKimball commented 1 year ago

@wence- Are there any cases where the weights argument is implemented? Should we update the docs?

wence- commented 1 year ago

Are there any cases where the weights argument is implemented?

Sampling with replacement is fine.

vyasr commented 1 year ago

When @isVoid wrote this, he and I didn't think too hard about documenting the exact supported cases since there are multiple different possibilities (CPU vs GPU sampling depending on whether we're operating row-wise or column-wise, weights, etc) and in most cases I think we just transparently support whatever the underlying libraries (numpy or cupy) support. In my weakly held opinion it would probably be easier to focus on making these error messages more instructive (e.g. "Random sampling with cupy does not support the input [insert input here]. Either [switch to using numpy for sampling] or [try this other workaround]") rather than trying to keep documentation up to date for each case.