pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.24k stars 1.95k forks source link

SEGFAULT on enormous `when/then/otherwise` expressions due to stackoverflow #12268

Open alexander-beedie opened 1 year ago

alexander-beedie commented 1 year ago

Checks

Reproducible example

The below expansion (while not the original code) replicates the observed error:

import polars as pl

df = pl.DataFrame({
    "x": range(-1000,1000), 
    "y": range( -500,1500),
})

def create_giant_when_then(x: pl.Expr, size: int):
    expr = pl
    for match_, return_ in {i: str(i) for i in range(size)}.items():
        expr = expr.when( x == match_ ).then( pl.lit(return_) )
    return expr.otherwise( None )

df.with_columns(
    x_str = create_giant_when_then( pl.col("x"), size=5000 ),
    y_str = create_giant_when_then( pl.col("y"), size=5000 ),
)

Note; by varying the size of the when/then from merely ridiculously large (5,000) to ludicrously large (10,000) it seems possible to generate slightly different flavours of sefault.

Log output

When size=5000:

Process finished with exit code 138 (interrupted by signal 10: SIGBUS)

When size=10000:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Note, this is on an Apple Silicon M2 Max; may need to vary the size for different architectures.

Issue description

Extremely large expressions can blow up the stack.

Expected behavior

Shouldn't segfault; ideally would either succeed or (when getting into really ridiculous territory!) raise an error.

Installed versions

``` --------Version info--------- Polars: 0.19.12 Index type: UInt32 Platform: macOS-14.1-arm64-arm-64bit Python: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:33:12) [Clang 15.0.7 ] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: 0.3.2 deltalake: fsspec: 2023.9.1 gevent: matplotlib: 3.6.3 numpy: 1.23.5 openpyxl: 3.1.2 pandas: 1.5.3 pyarrow: 13.0.0 pydantic: 1.10.13 pyiceberg: pyxlsb: sqlalchemy: 1.4.50 xlsx2csv: 0.8.1 xlsxwriter: 3.1.9```
avimallu commented 1 year ago

Perhaps identical to https://github.com/pola-rs/polars/issues/10173, which was labelled as not-planned?

alexander-beedie commented 1 year ago

Perhaps identical to #10173, which was labelled as not-planned?

It doesn't have to work (expressions of this scale probably indicate some poor choices being made on the user side), but equally it shouldn't segfault ;)

ritchie46 commented 1 year ago

The segfault is what we get in rust when we overflow the stack. There is no runtime check like in python, so that's what gives this scary abort.

It isn't something we can trivially fix though and will remain if we overflow the stack in polars. At some number of expression/node nesting I would expect a stackoverflow. We can investigate this one and see if we can increase the number, but I think there should be a number where we deem it acceptable for maintenance reasons.

evbo commented 12 months ago

Note, on linux by setting RUST_MIN_STACK or by setting .cargo/config.toml with, e.g.:

[build] rustflags = [ "-C", "link-arg=-zstack-size=104857600", ]

You can avoid this issue somewhat, so long as your stacksize is at least bounded. On OSX and other versions of clang the arguments are different but can also be set consistently with env vars, e.g.:

RUST_MIN_STACK=104857600 cargo test  -- --nocapture # no capture is just to execute println! during cargo test