rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.39k stars 897 forks source link

Support callables in DataFrame.assign #2591

Open jangorecki opened 5 years ago

jangorecki commented 5 years ago

Value I am trying to compute is a range between two measure variables v1, v2 within groups defined by id2, id4 categories. The following pandas/dask syntax could work

ans = x.groupby(['id2','id4']).agg({'v1': 'max', 'v2': 'min'}).assign(range_v1_v2=lambda x: x['v1'] - x['v2'])[['range_v1_v2']]
#  File "pyarrow/array.pxi", line 536, in pyarrow.lib.Array.from_pandas
#  File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
#  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
#  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
#pyarrow.lib.ArrowInvalid: Only 1D arrays accepted

reproducible example

import os
import gc
import cudf as cu
ver = cu.__version__
print(ver)
#0.8.0+0.g8fa7bd3.dirty
src_grp = "G1_1e7_1e2_0_0.csv"
x = cu.read_csv(src_grp, skiprows=1,
                names=['id1','id2','id3','id4','id5','id6','v1','v2','v3'],
                dtype=['str','str','str','int','int','int','int','int','float'])
ans = x.groupby(['id2','id4']).agg({'v1': 'max', 'v2': 'min'}).assign(range_v1_v2=lambda x: x['v1'] - x['v2'])[['range_v1_v2']]

generate data according to https://github.com/rapidsai/cudf/issues/2494

shwina commented 5 years ago

Thanks for reporting, @jangorecki. It looks like the underlying issue here is that cudf assign doesn't yet work with callable arguments.

edwintyh commented 3 years ago

Hello, I was wondering if this feature has been implemented?

kkraus14 commented 3 years ago

I don't believe that assign supports lambdas still. Contributions welcome!