The filters= argument to the read_parquet and read_orc functions is misleading in its name. Because it filters by row group, rather than by row, I think we should rename it to row_group_filters=.
Additionally, the description for this kwarg is complicated and laden with jargon. I think we should simplify and provide some examples that clarify the use of this keyword:
If not None, specifies a filter predicate used to filter out row groups
using statistics stored for each row group as Parquet metadata. Row groups
that do not match the given filter predicate are not read. The
predicate is expressed in disjunctive normal form (DNF) like
`[[('x', '=', 0), ...], ...]`. DNF allows arbitrary boolean logical
combinations of single column predicates. The innermost tuples each
describe a single column predicate. The list of inner predicates is
interpreted as a conjunction (AND), forming a more selective and
multiple column predicate. Finally, the outermost list combines
these filters as a disjunction (OR). Predicates may also be passed
as a list of tuples. This form is interpreted as a single conjunction.
To express OR in predicates, one must use the (preferred) notation of
list of lists of tuples
The
filters=
argument to theread_parquet
andread_orc
functions is misleading in its name. Because it filters by row group, rather than by row, I think we should rename it torow_group_filters=
.Additionally, the description for this kwarg is complicated and laden with jargon. I think we should simplify and provide some examples that clarify the use of this keyword: