rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.46k stars 908 forks source link

[FEA] Parquet reader filter improvements #17142

Open wence- opened 1 month ago

wence- commented 1 month ago

Is your feature request related to a problem? Please describe.

In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.

If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader does support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.

For example, column_name_reference("a") < literal(...) is a supported expression, but literal(...) > column_name_reference("a") is not (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.

This is suboptimal in two ways:

  1. Exactly which filter expressions are supported is now encoded in two places, and I might have got it wrong
  2. If parts of the whole filter expression are supported by the parquet reader, they are still applied in a post-filter stage, rather than being applied at the row group level.

Describe the solution you'd like

Describe alternatives you've considered

For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.

For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.

Additional context

What I'm doing now: https://github.com/rapidsai/cudf/pull/17141

Additional feature req: support filtering row groups based on nulls, i.e. support is_null(column_name_reference(...)) in the statistics reader.

GregoryKimball commented 1 month ago

@wence- Thank you for mapping out this request.

wence- commented 1 month ago

Picking up on the additional feature request, I believe we added support for is_null in https://github.com/rapidsai/cudf/pull/13145. Is something not working correctly?

That's supported in ASTs, but not in the statistics filter (e.g. discarding a row group where the filter is column.is_not_null() because the row group's null count is equal to the number of rows in the group).

wence- commented 1 month ago

What would be a good unit test of an AST that needs to be modified to run as a row group filter? I checked your PR, does test_evaluation contain such an example? I would like your help to gather a few examples of input AST to output AST that this feature would need.

Here are some examples, in pseudo-code, I can write things out in C++ once we converge on something that makes sense:

expr = (10 < a)

This would be supported by the statistics reader if the user had instead written:

expr = (a > 10)

Something more complicated:

expr = (a < 10) or ((a + b < c) and (b < 1))

Here two parts of the expression can discard row groups a < 10 and b < 1, but the middle part of the expression a + b < c is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking if a_max + b_max < c_min), and is currently unsupported by the statistics row group filtering.

That is, I can discard row-groups with a < 10 or b < 1 and then apply the full expression as a post-filter.

expr_stat = (a < 10) or (b < 1)
expr = (a < 10) or ((a + b < c) and (b < 1))
df = read_parquet(..., expr_stat).filter(expr)

For arbitrary expressions, determining which bits can be applied as statistics filters is programmatically achievable by converting the input expression to one of DNF or CNF and then taking those terms which the reader supports.

wence- commented 1 month ago

but the middle part of the expression a + b < c is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking if a_max + b_max < c_min)

For inequality comparisons, one could imagine writing an ast visit that can handle a larger number of such expressions. To take < as an example, lexpr < rexpr as a statistics filter needs to turn into lexpr.max() < rexpr.min(). One can write distribution rules for max and min over binary operators (at least some of them), but the bounds get pessimistic quite quickly. For example, for the addition, multiplication, ..., we have:

(a + b).max() <= a.max() + b.max()
(a - b).max() <= a.max() - b.min()
(a * b).max() <= a.max() * b.max()
(a / b).max() <= a.max() / b.min()

One can apply such rules recursively to push max/min references down to column/literal references. That then gives us expressions that can used in tandem with row group statistics.

I think that's much less likely to be important than just picking out all the pieces of an expression that compare a column with a literal though.