Open Chuck321123 opened 1 month ago
Polars pre-computes the hashes for the input data, which is why the memory usage roughly doubles. But this is just temporary, you'll notice that if you call unique
multiple times it doesn't increase further.
Using streaming should also reduce the amount of extra memory used as it doesn't pre-compute everything in one batch.
That the unique() and n_unique() function doesnt cause any memory spikes.
You need to be able to use space to compute an algorithm. Pre-computing a hash is a reasonable trade off.
I think we can close this as I don't think we have any action on this. Try the old streaming engine, or wait until the new one is released and supports unique.
Polars pre-computes the hashes for the input data, which is why the memory usage roughly doubles. But this is just temporary, you'll notice that if you call unique multiple times it doesn't increase further.
Ah, this might not be the case on windows. The mimalloc allocator we release there has a lot of fragmentation. (Maybe we should just use the default allocator on windows).
Checks
Reproducible example
Polars memory usage:
Pandas memory usage:
Log output
No response
Issue description
So there have been mentions about this earlier, as described in #16857, but I write it here once more to bring it to the spotlight. There are memory problems with unique() and n_unique(), especially compared to pandas. If you search "unique" in the open cases there are a lot of crashes, freezes, codes taking way too long and out of memory problems, most related to the unique() function. Could therefore fix many of the cases if one could find a way to reduce the ram usage of the unique() and n_unique() functions. If you feel the need to close this in favor of #16857, that is also alright.
Expected behavior
That the unique() and n_unique() function doesnt cause any memory spikes.
Installed versions