pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.58k stars 1.89k forks source link

Stack overflow on joinon high cpu count machine #14637

Open DylanZA opened 7 months ago

DylanZA commented 7 months ago

Checks

Reproducible example

import polars as pl

NIDS = 500
NAXISA = 1000
NVALS = 100
NAXISB = 5000

ids = pl.Series("id", range(NIDS), pl.Int64)
vals = pl.Series("val", range(NVALS), pl.Int64)
extraa = pl.Series("axis", range(NAXISA), pl.Int64)
extrab = pl.Series("axis", range(NAXISB), pl.Int64)

a = pl.DataFrame(ids).join(pl.DataFrame(extraa), how='cross').join(pl.DataFrame(vals), how='cross')
b = pl.DataFrame(ids).join(pl.DataFrame(extrab), how='cross')

print("a={:,} b={:,}".format(len(a),len(b)))
for i in range(30):
  print(i)
  a.join(b,on=['id','axis'])

Log output

join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
'a=50,000,000 b=2,500,000'
0
join parallel: true
INNER join dataframes finished
1
join parallel: true
INNER join dataframes finished
2
join parallel: true
INNER join dataframes finished
3
join parallel: true
INNER join dataframes finished
4
join parallel: true
INNER join dataframes finished
5
join parallel: true
INNER join dataframes finished
6
join parallel: true
INNER join dataframes finished
7
join parallel: true
INNER join dataframes finished
8
join parallel: true
Segmentation fault (core dumped)```

Issue description

Joining two biggish dataframes (50 million and 2.5 million) occasionally is stack overflowing somewhere deep in rayon.

Note this seems to happen only on my machine with 128 logical cores, and not my smaller machine.

Expected behavior

It does not crash

Installed versions

``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: Linux-6.5.0-14-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: 3.8.1 numpy: 1.26.1 openpyxl: pandas: 2.1.2 pyarrow: 14.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 7 months ago

Hmm.. Could you see in GDB where you are at that point? Maybe make a debug compilation. (I am on vacation and have poor internet, let alone a 128 core machine. :))

DylanZA commented 7 months ago

Hmm.. Could you see in GDB where you are at that point? Maybe make a debug compilation. (I am on vacation and have poor internet, let alone a 128 core machine. :))

Here is a full gdb thread backtrace. compressed as it's 15MB uncompressed

gdb.txt.gz

DylanZA commented 6 months ago

fwiw you can replicate the behaviour by setting RUST_MIN_STACK. On my 16 core machine I need to use 100000