pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.68k stars 1.99k forks source link

`extract_many` + `overlapping=True` produces different result when `streaming=True` #19260

Open cmdlineluser opened 1 month ago

cmdlineluser commented 1 month ago

Checks

Reproducible example

import polars as pl

df = pl.LazyFrame({"x": ["THE thesis", "sis"], "y": ["the", "thesis"]})

q = df.select(
    pl.col.x.str.extract_many(pl.col.y, ascii_case_insensitive=True, overlapping=True)
)

q.collect(streaming=True)
# shape: (2, 1)
# ┌────────────────┐
# │ x              │
# │ ---            │
# │ list[str]      │
# ╞════════════════╡
# │ ["THE", "the"] │
# │ []             │
# └────────────────┘

Log output

No response

Issue description

With streaming=False the output is as expected.

q.collect()
# shape: (2, 1)
# ┌──────────────────────────┐
# │ x                        │
# │ ---                      │
# │ list[str]                │
# ╞══════════════════════════╡
# │ ["THE", "the", "thesis"] │
# │ []                       │
# └──────────────────────────┘

Expected behavior

The same result as streaming=False

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: macOS-13.6.1-arm64-arm-64bit Python: 3.12.2 (main, Feb 6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
orlp commented 1 month ago

I fail to see why ["THE", "the", "thesis"] is the expected output when matching "THE thesis" with "the". Surely it's the streaming engine that's correct and the eager engine that's wrong?