rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.43k stars 904 forks source link

[FEA] Coalesce #3754

Closed revans2 closed 1 year ago

revans2 commented 4 years ago

Is your feature request related to a problem? Please describe. We would like to implement the coalesce SQL function for spark. There currently exists a binary op enum for this, but there is no backend for it so we end up with JIT compilation errors if we try to use it.

Describe the solution you'd like At a minimum we would like to see the binary op coalesce implemented. The spark version can take a variable number of arguments so if we could get something that takes an array of columns/scalars that would be even better because we would not have to materialize as many intermediate values.

Describe alternatives you've considered we could implement this with isNull and copyIfelse, but we would have to materialize more intermediate results.

devavret commented 4 years ago

Can you describe this a bit? I'm unable to understand how it is a binary op. From here it seems it's more like a drop nulls function except only the first value is required.

Do you mean given two columns, coalesce would return this:

column_view lhs;
column_view rhs;

for i in (0, lhs.size())
  if (lhs.isvalid(i))
    output[i] = lhs[i];
  else if (not lhs.isvalid(i) and rhs.isvalid(i))
   output[i] = rhs[i];
  else
    output.set_null(i);
revans2 commented 4 years ago

Sorry for the late reply. Yes that is 100% correct.

sriramch commented 4 years ago

@revans2 what is the expected behavior for coalesce(column, scalar)? is it to replace null values in column with scalar, if scalar is valid; else return column?

also, what about coalesce(scalar, column)? is it to create a column filled with scalar values if it is valid; else return column?

@devavret is this issue open to evaluate/consider supporting array of columns/scalars to binaryop, or to simply add binaryop support for colaesce? if it is the latter, wouldn't it still require callers to create intermediary columns, if there are a number of such columns to be coalesced? if so, how would this be different from copy_if_else?

revans2 commented 4 years ago

Sorry, we have an implementation based off of copy_if_else. What we have right now looks like the following psudo code.

auto lhs_not_null = is_not_null(lhs);
auto result = copy_if_else(lhs, rhs, lhs_not_null);

So a binary op would save us materializing one BOOL8 column. If we could get something with more than one column/scalar as input it would save us even more.

sriramch commented 4 years ago

thanks @revans2. can you also confirm the coalesce behavior with scalars, if it is indeed a valid use-case?

revans2 commented 4 years ago

Yes we do support scalars as if they were exploded out into a column

devavret commented 4 years ago

Wait a minute. I think we have this: https://github.com/rapidsai/cudf/blob/08d828032afe8a7ca99efee701af21dcdacbaf01/cpp/include/cudf/replace.hpp#L24-L39

sriramch commented 4 years ago

@devavret - great - that ought to do it. i was wondering whether this api would bring over the value for a given row from the input column to output, if both the input and replacement were non null, and from a cursory read of the code, it seems to do it.

so, do we assume now that binop support for an array of columns aren't going to be supported anytime soon?

devavret commented 4 years ago

For binop support for an array of columns, there's an issue of API design. Since the value priority seems to be dependent upon the index of the column, the API should look something like cudf::coalesce(std::vector<column_view> views) where values from views[0] have first priority right? I can't figure out how to intermingle scalars in there. So if your data was

c0 = [1, Null, Null]
s = 2
c1 = [3, 4, 5]

The result should be

r = [1, 2, 2]

But what format should arguments for cudf::coalesce be to put the scalar between the two cols.

In general, what should the API be to enable coalesce of c0, c1, s0, c2, s1, s2?

So a true generic API wouldn't be neat.

harrism commented 4 years ago

IIUC it's not a vector op, it's a binary op. There are always two inputs and one output. So there are 4 combinations:

  1. vector = coalesce(vector, vector);
  2. vector = coalesce(vector, scalar);
  3. vector = coalesce(scalar, vector);
  4. scalar = coalesce(scalar, scalar);

To me, this seems like yet another reason to eliminate cudf::scalar. :)

jlowe commented 4 years ago

IIUC it's not a vector op, it's a binary op.

As @revans2 mentioned above, Spark's coalesce operation is variadic, taking the first non-null value in the variable list of arguments. With only a binary op, we'd have to chain the operations as pairs which isn't as efficient (but still better than the copy_if_else method we have today).

Thinking out loud on the scalars front, we keep coming across cases where we want a scalar to proxy as a column_view full of the same value without needing to blow out the duplicate values in memory. It would be nice if libcudf had the ability for scalars to appear like a column_view without needing to manifest the data in memory. I realize that's not possible with how column_view works, but I'm thinking of operations that just want to know how many rows there are and iterators to the input values. copy_if_else used iterators to help reduce duplication between handling scalars and columns, and it would be nice if there was some common base-class between column_view and scalar where they both could be passed as arguments to the same method for cases where the scalar just needs to provide an iterator (and potentially row count). Something like column_value_iterable. Would that make sense? cc: @jrhemstad

sriramch commented 4 years ago

a const iterator to scalar does exist today (where the scalar iterator keeps returning the same value) which can be used to iterate over a column and the scalar value in tandem.

devavret commented 4 years ago

@sriramch, that’s the one used by copy if else. I think @jlowe meant that we should add a common base class that lets us pass both a column view and a scalar reference/view to a cudf function. And the iterator is what would be common between both. As in, they should both provide an iterator that just works and that is opaque to the user whether it’s coming from an underlying scalar or column.

So that would allow a generalised coalesce API like this: cudf::coalesce(std::vector<common_view>)

GregoryKimball commented 1 year ago

Spark-RAPIDS is now using the replace_nulls API to perform the transformation that was originally requested here. I'll close this for now, and let's please open a new issue for future requirements.