I found that slice pushdown is not enabled when combined with filtering. You can see in this example that slice is done after parquet scan when combined with a filter. Both the query plan and the running time reflect this.
With both slice and predicate pushdown optimization, the query should take roughly the same amount of time as head(3). Instead it takes 460 times longer.
Generate data:
import numpy as np
import polars as pl
np.random.seed(43)
pl.DataFrame(
{'a': np.random.randint(0, 1_000_000, size=500_000_000)}
).write_parquet('data.parquet')
Examples
import polars as pl
from codetiming import Timer
df = pl.scan_parquet('data.parquet')
with Timer(initial_text="\n## head(3)\n"):
print(df.head(3).explain())
_ = df.head(3).collect()
with Timer(initial_text="\n## filter\n"):
df_filtered = df.filter(pl.col('a') > 2)
print(df_filtered.explain())
_ = df_filtered.collect()
with Timer(initial_text="\n## filter and head(3)\n"):
df_filtered_head = df_filtered.head(3)
print(df_filtered_head.explain())
_ = df_filtered_head.collect()
What you're looking for is early stopping, which is something the new streaming engine will support.
Great. That is exactly what I'm looking for. Related question: will the new streaming engine support that filter().slice(offset, length) skips ahead to offset without keeping all the offset rows in memory? I.e., memory will scale with length.
It's fundamentally impossible to push down a head down past a filter.
Yeah, I see that you cannot do random access (slice) before the predicate has been evaluated, if not fully, then at least until the location of the slice.
Yeah, I see that you cannot do random access (slice) before the predicate has been evaluated, if not fully, then at least until the location of the slice.
This is also not possible because we know how many items were filtered before.
The plan is to eventually split the IO based slices into pre-filter-slice and post-filter-slice. That will make quite some optimizations possible.
Description
I found that slice pushdown is not enabled when combined with filtering. You can see in this example that slice is done after parquet scan when combined with a filter. Both the query plan and the running time reflect this.
With both slice and predicate pushdown optimization, the query should take roughly the same amount of time as
head(3)
. Instead it takes 460 times longer.Generate data:
Examples
It's fundamentally impossible to push down a head down past a filter.
What you're looking for is early stopping, which is something the new streaming engine will support.
Great. That is exactly what I'm looking for. Related question: will the new streaming engine support that
filter().slice(offset, length)
skips ahead tooffset
without keeping all the offset rows in memory? I.e., memory will scale withlength
.Yeah, I see that you cannot do random access (slice) before the predicate has been evaluated, if not fully, then at least until the location of the slice.
This is also not possible because we know how many items were filtered before.
The plan is to eventually split the IO based slices into
pre-filter-slice
andpost-filter-slice
. That will make quite some optimizations possible.