Closed isharamet closed 1 year ago
Hi @isharamet ,
It looks like a bug. There is a Github issue created to replace the custom in
predicate with one recently introduced to Parquet: #272.
And I believe the error is in inverseCanDrop
method of InPredicate
:
override def inverseCanDrop(statistics: Statistics[T]): Boolean = {
val compare = statistics.getComparator.compare(_, _)
val min = statistics.getMin
val max = statistics.getMax
val isInRange = (value: T) => compare(value, min) >= 0 && compare(value, max) <= 0
values.exists(isInRange)
}
While it works for canDrop
, in not in
scenario it'll drop all the blocks with values from the set, even if blocks might contain other values. So, for example from my original post all values will be stored in a single block (min = 1
, max = 5
), so this block will be skipped.
Hi @isharamet,
We faced this bug too and ended up implementing our own NinPredicate similar to https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/Filter.scala#L323 with both canDrop
and inverseCanDrop
returning false
. I don't see an option to utilize this "drop whole block if values not in range" logic, since for nin
it should rather be "accept whole block if values not in range".
This can be a temporary solution until nin
predicate is supported.
Fix to be released in 2.11.0
Hey folks,
Not sure if I'm making something wrong there, but applying
not in
filter toParquetReader.Builder
and then reading the data always results in 0 rows (while it shouldn't).Simple app to reproduce the issue:
Reproducible on the lates
master
andv2.10.0
. I'll try to dig deeper, but wasn't able to find the reason for such behaviour yet.