[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
import numpy as np
def max_usage_mb():
import resource
maximal = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f'{maximal / 1000: 7.1f}', 'MB')
data = np.random.randint(0, 255, [128_000, 256], dtype='uint8')
## following line should be uncommented to compare with bytes
# data = [row.tobytes() for row in np.random.randint(0, 255, [128_000, 256], dtype='uint8')]
max_usage_mb()
df = pl.DataFrame(dict(x=data))
max_usage_mb()
df.unique()
max_usage_mb()
Log output
bytes field:
340.6 MB
340.6 MB
340.6 MB
array of uint8 field:
202.0 MB
225.7 MB
3536.0 MB # note this unreasonable 10x hike in memory
Issue description
For no obvious reason array fields are demanding too muh memory.
I guess underlying issue is that deduplication does not rely on hashing (as it should).
In repro I compare to plain bytes, which would give similar footprint for .unique()
Expected behavior
Memory consumption of two cases is identical (as well as speed).
Checks
Reproducible example
Log output
bytes field:
array of uint8 field:
Issue description
For no obvious reason array fields are demanding too muh memory. I guess underlying issue is that deduplication does not rely on hashing (as it should).
In repro I compare to plain bytes, which would give similar footprint for .unique()
Expected behavior
Memory consumption of two cases is identical (as well as speed).
Installed versions