Open wence- opened 1 month ago
@wence- Thank you for mapping out this request.
test_evaluation
contain such an example? I would like your help to gather a few examples of input AST to output AST that this feature would need.is_null
in https://github.com/rapidsai/cudf/pull/13145. Is something not working correctly?Picking up on the additional feature request, I believe we added support for is_null in https://github.com/rapidsai/cudf/pull/13145. Is something not working correctly?
That's supported in ASTs, but not in the statistics filter (e.g. discarding a row group where the filter is column.is_not_null()
because the row group's null count is equal to the number of rows in the group).
What would be a good unit test of an AST that needs to be modified to run as a row group filter? I checked your PR, does test_evaluation contain such an example? I would like your help to gather a few examples of input AST to output AST that this feature would need.
Here are some examples, in pseudo-code, I can write things out in C++ once we converge on something that makes sense:
expr = (10 < a)
This would be supported by the statistics reader if the user had instead written:
expr = (a > 10)
Something more complicated:
expr = (a < 10) or ((a + b < c) and (b < 1))
Here two parts of the expression can discard row groups a < 10
and b < 1
, but the middle part of the expression a + b < c
is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking if a_max + b_max < c_min
), and is currently unsupported by the statistics row group filtering.
That is, I can discard row-groups with a < 10 or b < 1
and then apply the full expression as a post-filter.
expr_stat = (a < 10) or (b < 1)
expr = (a < 10) or ((a + b < c) and (b < 1))
df = read_parquet(..., expr_stat).filter(expr)
For arbitrary expressions, determining which bits can be applied as statistics filters is programmatically achievable by converting the input expression to one of DNF or CNF and then taking those terms which the reader supports.
but the middle part of the expression
a + b < c
is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking ifa_max + b_max < c_min
)
For inequality comparisons, one could imagine writing an ast visit that can handle a larger number of such expressions. To take <
as an example, lexpr < rexpr
as a statistics filter needs to turn into lexpr.max() < rexpr.min()
. One can write distribution rules for max
and min
over binary operators (at least some of them), but the bounds get pessimistic quite quickly. For example, for the addition, multiplication, ..., we have:
(a + b).max() <= a.max() + b.max()
(a - b).max() <= a.max() - b.min()
(a * b).max() <= a.max() * b.max()
(a / b).max() <= a.max() / b.min()
One can apply such rules recursively to push max
/min
references down to column/literal references. That then gives us expressions that can used in tandem with row group statistics.
I think that's much less likely to be important than just picking out all the pieces of an expression that compare a column with a literal though.
Is your feature request related to a problem? Please describe.
In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.
If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader does support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.
For example,
column_name_reference("a") < literal(...)
is a supported expression, butliteral(...) > column_name_reference("a")
is not (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.This is suboptimal in two ways:
Describe the solution you'd like
Describe alternatives you've considered
For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.
For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.
Additional context
What I'm doing now: https://github.com/rapidsai/cudf/pull/17141
Additional feature req: support filtering row groups based on nulls, i.e. support
is_null(column_name_reference(...))
in the statistics reader.