[FEA] Rename `filters=` argument to `row_group_filters=` in `read_parquet` and `read_orc` and provide examples that show its use

shwina commented 1 year ago

The filters= argument to the read_parquet and read_orc functions is misleading in its name. Because it filters by row group, rather than by row, I think we should rename it to row_group_filters=.

Additionally, the description for this kwarg is complicated and laden with jargon. I think we should simplify and provide some examples that clarify the use of this keyword:

    If not None, specifies a filter predicate used to filter out row groups
    using statistics stored for each row group as Parquet metadata. Row groups
    that do not match the given filter predicate are not read. The
    predicate is expressed in disjunctive normal form (DNF) like
    `[[('x', '=', 0), ...], ...]`. DNF allows arbitrary boolean logical
    combinations of single column predicates. The innermost tuples each
    describe a single column predicate. The list of inner predicates is
    interpreted as a conjunction (AND), forming a more selective and
    multiple column predicate. Finally, the outermost list combines
    these filters as a disjunction (OR). Predicates may also be passed
    as a list of tuples. This form is interpreted as a single conjunction.
    To express OR in predicates, one must use the (preferred) notation of
    list of lists of tuples

shwina commented 1 year ago

GregoryKimball commented 1 year ago

I would like to close this in favor of #12512

rapidsai / cudf

[FEA] Rename `filters=` argument to `row_group_filters=` in `read_parquet` and `read_orc` and provide examples that show its use #13370