pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.51k stars 1.98k forks source link

unique() and n_unique() causes large memory spikes in polars #19411

Open Chuck321123 opened 1 month ago

Chuck321123 commented 1 month ago

Checks

Reproducible example

Polars memory usage: Polars unique

Pandas memory usage: Pandas unique

Log output

No response

Issue description

So there have been mentions about this earlier, as described in #16857, but I write it here once more to bring it to the spotlight. There are memory problems with unique() and n_unique(), especially compared to pandas. If you search "unique" in the open cases there are a lot of crashes, freezes, codes taking way too long and out of memory problems, most related to the unique() function. Could therefore fix many of the cases if one could find a way to reduce the ram usage of the unique() and n_unique() functions. If you feel the need to close this in favor of #16857, that is also alright.

Expected behavior

That the unique() and n_unique() function doesnt cause any memory spikes.

Installed versions

``` --------Version info--------- Polars: 1.8.2 Index type: UInt32 Platform: Windows-11-10.0.22631-SP0 Python: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle 3.0.0 connectorx deltalake fastexcel fsspec gevent great_tables matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 2.0.2 openpyxl 3.1.5 pandas 2.2.3 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
orlp commented 1 month ago

Polars pre-computes the hashes for the input data, which is why the memory usage roughly doubles. But this is just temporary, you'll notice that if you call unique multiple times it doesn't increase further.

Using streaming should also reduce the amount of extra memory used as it doesn't pre-compute everything in one batch.

ritchie46 commented 3 weeks ago

That the unique() and n_unique() function doesnt cause any memory spikes.

You need to be able to use space to compute an algorithm. Pre-computing a hash is a reasonable trade off.

I think we can close this as I don't think we have any action on this. Try the old streaming engine, or wait until the new one is released and supports unique.

ritchie46 commented 3 weeks ago

Polars pre-computes the hashes for the input data, which is why the memory usage roughly doubles. But this is just temporary, you'll notice that if you call unique multiple times it doesn't increase further.

Ah, this might not be the case on windows. The mimalloc allocator we release there has a lot of fragmentation. (Maybe we should just use the default allocator on windows).