rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.27k stars 884 forks source link

[FEA] Rename `filters=` argument to `row_group_filters=` in `read_parquet` and `read_orc` and provide examples that show its use #13370

Open shwina opened 1 year ago

shwina commented 1 year ago

The filters= argument to the read_parquet and read_orc functions is misleading in its name. Because it filters by row group, rather than by row, I think we should rename it to row_group_filters=.

Additionally, the description for this kwarg is complicated and laden with jargon. I think we should simplify and provide some examples that clarify the use of this keyword:

    If not None, specifies a filter predicate used to filter out row groups
    using statistics stored for each row group as Parquet metadata. Row groups
    that do not match the given filter predicate are not read. The
    predicate is expressed in disjunctive normal form (DNF) like
    `[[('x', '=', 0), ...], ...]`. DNF allows arbitrary boolean logical
    combinations of single column predicates. The innermost tuples each
    describe a single column predicate. The list of inner predicates is
    interpreted as a conjunction (AND), forming a more selective and
    multiple column predicate. Finally, the outermost list combines
    these filters as a disjunction (OR). Predicates may also be passed
    as a list of tuples. This form is interpreted as a single conjunction.
    To express OR in predicates, one must use the (preferred) notation of
    list of lists of tuples
shwina commented 1 year ago

Related: https://github.com/rapidsai/cudf/pull/13334

GregoryKimball commented 1 year ago

I would like to close this in favor of #12512