Open cmdlineluser opened 1 month ago
import polars as pl df = pl.LazyFrame({"x": ["THE thesis", "sis"], "y": ["the", "thesis"]}) q = df.select( pl.col.x.str.extract_many(pl.col.y, ascii_case_insensitive=True, overlapping=True) ) q.collect(streaming=True) # shape: (2, 1) # ┌────────────────┐ # │ x │ # │ --- │ # │ list[str] │ # ╞════════════════╡ # │ ["THE", "the"] │ # │ [] │ # └────────────────┘
No response
With streaming=False the output is as expected.
streaming=False
q.collect() # shape: (2, 1) # ┌──────────────────────────┐ # │ x │ # │ --- │ # │ list[str] │ # ╞══════════════════════════╡ # │ ["THE", "the", "thesis"] │ # │ [] │ # └──────────────────────────┘
The same result as streaming=False
I fail to see why ["THE", "the", "thesis"] is the expected output when matching "THE thesis" with "the". Surely it's the streaming engine that's correct and the eager engine that's wrong?
["THE", "the", "thesis"]
"THE thesis"
"the"
Checks
Reproducible example
Log output
No response
Issue description
With
streaming=False
the output is as expected.Expected behavior
The same result as
streaming=False
Installed versions