Open isvoboda opened 3 weeks ago
same as https://github.com/pola-rs/polars/issues/15323
The statistics are written as a signed int and since it's bigger than the INT64 limit the statistics overflow
<pyarrow._parquet.Statistics object at 0x7f2abffd0bd0>
has_min_max: True
min: 9223539763183779054
max: 9222809258525037712
null_count: 0
distinct_count: None
num_values: 262615
physical_type: INT64
logical_type: Int(bitWidth=64, isSigned=false)
converted_type (legacy): UINT_64
note how the min is bigger than the max.
Essentially what's happening is that when you use eq
it skips all the row groups based on statistics but is_in
doesn't do partition pruning and that's why it returns results. You could also turn off optimizations in the collect under the eq
case.
@deanm0000, I apologize for the duplicate issue, and I appreciate you identifying the cause.
Will take a look
Will re-open this issue - the fix by https://github.com/pola-rs/polars/pull/16766 ensures we no longer write out parquet files with incorrect UInt64 min/max statistics, but the OP here gives an example that has more to do with reading an existing parquet file containing incorrect statistics. I've changed this from a bug to enhancement request as there isn't really a bug in the polars parquet reader, but rather the issue is in the parquet file itself.
Thanks @isvoboda for reporting this as well, I've edited your post to better highlight the underlying issue and use a smaller file.
Description
Some parquet files may contain incorrectly calculated statistics (e.g. some of the ones written by older versions of polars containing UInt64 statistics had incorrect min/max). Because we assume the statistics are correct, using some functions (e.g.
is_in
) withscan_parquet
would return incorrect results if the statistics were not correct.Reproducible example
For the below example we have a parquet file with incorrect min/max statistics (observe how
min (9223372036854775808) > max (0)
.Log output
Original post [Scanning parquet for a UInt64 value does not work for equality]
### Checks - [X] I have checked that this issue has not already been reported. - [X] I have confirmed this bug exists on the [latest version](https://pypi.org/project/polars/) of Polars. ### Reproducible example ```python import polars as pl # Does not find ( pl.scan_parquet("data.parquet") .filter( pl.col("value").eq(pl.lit(16287934034053521795, dtype=pl.UInt64())) ).collect() ) shape: (0, 1) ┌───────┐ │ value │ │ --- │ │ u64 │ ╞═══════╡ └───────┘ # Does find ( pl.scan_parquet("data.parquet") .filter( pl.col("value").is_in(pl.lit(16287934034053521795, dtype=pl.UInt64()), ) ) .collect() ) shape: (18, 1) ┌──────────────────────┐ │ value │ │ --- │ │ u64 │ ╞══════════════════════╡ │ 16287934034053521795 │ │ 16287934034053521795 │ │ 16287934034053521795 │ │ 16287934034053521795 │ │ 16287934034053521795 │ │ … │ │ 16287934034053521795 │ │ 16287934034053521795 │ │ 16287934034053521795 │ │ 16287934034053521795 │ │ 16287934034053521795 │ └──────────────────────┘ ``` ### Log output ```shell Multiple lines with: parquet file can be skipped, the statistics were sufficient to apply the predicate. ``` ### Issue description Scanning parquet file based on `eq` on a `UInt64` column won't find some value. Sample parquet file: https://1drv.ms/u/s!AiNNar540QGDhKI7u7VsGoHG0WV3CQ?e=7r9XbH ### Expected behavior Filtering parquet file based on `eq` on `UInt64` column works with same result as equivalent based on `is_in`. ### Installed versions