pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.14k stars 1.94k forks source link

Large hike in memory for .unique on array fields #16857

Open arogozhnikov opened 4 months ago

arogozhnikov commented 4 months ago

Checks

Reproducible example

import polars as pl
import numpy as np

def max_usage_mb():
    import resource
    maximal = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print(f'{maximal / 1000: 7.1f}', 'MB')

data = np.random.randint(0, 255, [128_000, 256], dtype='uint8')
## following line should be uncommented to compare with bytes
# data = [row.tobytes() for row in np.random.randint(0, 255, [128_000, 256], dtype='uint8')]
max_usage_mb()
df = pl.DataFrame(dict(x=data))
max_usage_mb()
df.unique()
max_usage_mb()

Log output

bytes field:

  340.6 MB
  340.6 MB
  340.6 MB

array of uint8 field:

  202.0 MB
  225.7 MB
 3536.0 MB  # note this unreasonable 10x hike in memory

Issue description

For no obvious reason array fields are demanding too muh memory. I guess underlying issue is that deduplication does not rely on hashing (as it should).

In repro I compare to plain bytes, which would give similar footprint for .unique()

Expected behavior

Memory consumption of two cases is identical (as well as speed).

Installed versions

``` --------Version info--------- Polars: 0.20.25 Index type: UInt32 Platform: Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: 2.6.4 pyiceberg: pyxlsb: sqlalchemy: torch: 2.3.0+cu121 xlsx2csv: xlsxwriter: ```
Chuck321123 commented 4 months ago

Can confirm spike in the unique() function Unique spike

orlp commented 1 week ago

The underlying issue is that polars eagerly hashes the whole array, and that this array of hashes is 8x bigger than the input u8 array.