pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.33k stars 1.86k forks source link

memory_map creates copies during certain operations since 0.19.4 #11546

Open kevfly16 opened 11 months ago

kevfly16 commented 11 months ago

Checks

Reproducible example

import polars as pl
df = pl.read_ipc("data.feather", memory_map=True)
df[[1,2]]

The line df[[1,2]] will start copying the entire DataFrame to in-process memory.

Log output

No response

Issue description

When reading an arrow-backed DataFrame (e.g., read_ipc) from RAM disk using the option memory_map=True, polars creates a zero-copy DataFrame. In version 0.19.3, indexing the DataFrame by multiple integer positions or performing a join on an already sorted DataFrame would also be zero-copy. Since version 0.19.4, these operations are no longer zero-copy and instead copy the DataFrame to in-process memory. With a large enough file this becomes obvious because the operations take a lot longer and, in some cases, the process runs out of memory.

Expected behavior

Similar to the behavior in version 0.19.3, I would expect that these operations remain zero-copy in newer versions.

Installed versions

``` --------Version info--------- Polars: 0.19.7 Index type: UInt32 Platform: macOS-13.5.1-arm64-arm-64bit Python: 3.11.2 (v3.11.2:878ead1ac1, Feb 7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: 2.2.1 connectorx: deltalake: fsspec: 2023.3.0 gevent: matplotlib: numpy: 1.25.0 openpyxl: pandas: 2.1.1 pyarrow: 13.0.0 pydantic: 2.4.2 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 11 months ago

What do you mean zero-copy? When you access memory mapped data, you will hit a page fault and the OS will load that data from disk into memory.

kevfly16 commented 11 months ago

My data is all in memory already on a mounted tmpfs volume. There should never be a page fault.

When I say zero-copy, I mean that the data stored on the file never needs to be copied to in-process memory. Polars will create some objects but the underlying data will be backed by the file. If that file is already in memory, we shouldn’t see any copies. This was true in 0.19.3, but stopped being true in all later versions. Since 0.19.4, some operations will duplicate the data, one version stored on tmpfs and another duplicated in the Python process.

As another example, this won’t copy any data and performs as it should:

df[1]
df[2]
df[1000]

But this will cause the latest polars to make a copy of the entire DataFrame in process:

df[[1,2,1000]]
ritchie46 commented 11 months ago

When I say zero-copy, I mean that the data stored on the file never needs to be copied to in-process memory.

But that's not how memory mapping works. The os will read pages from disk once you access that data.

Can you maybe create a minimal example that shows the problem?

kevfly16 commented 11 months ago

It seems like the issue stems from using pyarrow types. I couldn't figure out how to get an Arrow V2 file stored with pyarrow types using polars alone, so I used pandas instead. I tested with pyarrow integers, doubles, and large_strings and found the same issue. A full, reproducible example using doubles below.

I'm using a linux machine with 1 vCPU and 1 GB RAM (AWS t2.micro):

1. Mount RAM disk and update current directory

mkdir ramdisk
mount -t tmpfs -o size=1G tmpfs ramdisk
cd ramdisk

2. Install dependencies (note using version 0.19.3 of polars)

python3 -m ensurepip
pip3 install numpy==1.26.0 pandas==2.1.1 pyarrow==13.0.0 polars==0.19.3

3. Open a python shell

# python3

4. Create some random data and store it in an Arrow V2 file [~153 MB]

import numpy as np
import pandas as pd
size = 10_000_000
df = pd.DataFrame({ "a": np.random.rand(size), "b": np.random.rand(size) }, dtype="double[pyarrow]")
df.to_feather("example.feather", compression="uncompressed")

5. Close the python shell and reopen

>>> exit()
# python3

6. Read example.feather as a memory mapped file from the RAM disk

import polars as pl
df = pl.read_ipc("example.feather", memory_map=True)

7. Observe memory usage in a separate shell / window

ps aux | head -n1; ps aux | grep python3
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        5304  2.5  4.3 469172 42716 pts/0    S+   15:13   0:00 python3

8. Access dataframe

df[100], df[101]

9. Observe memory usage

ps aux | head -n1; ps aux | grep python3
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        5304  0.3  4.3 469172 42716 pts/0    S+   15:13   0:00 python3

10. Access dataframe at multiple positions

df[[100, 101]]

11. Observe memory usage

ps aux | head -n1; ps aux | grep python3
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        5304  0.2  4.5 537016 45208 pts/0    Sl+  15:13   0:00 python3

12. Now install the latest version of polars as of October 7, 2023

pip3 install -v polars==0.19.7

13. Repeating steps 6-10 above, memory usage spikes after step 10 using latest version of polars

ps aux | head -n1; ps aux | grep python3
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        6078  0.7 19.4 653888 192812 pts/0   Sl+  15:15   0:00 python3

When a file is already stored in memory (e.g., RAM disk), the OS never needs to read the memory mapped file from disk into in-process memory. Since we're using the Arrow memory format to store our files, we can leverage zero-copy reads without any serialization overhead except for creating the high level python classes (e.g., dataframe library objects). Across all versions, accessing the rows independently (step 8) is zero-copy. In version 0.19.3 we were also able to access multiple positions at the same time (step 9) with zero-copy as well, but a regression seems to have been introduced since version 0.19.4. This affects other polars methods like join too.

kevfly16 commented 2 months ago

@stinodego @ritchie46 Really appreciate all the work that you put into this package. Polars has been working really well for us, but we can't upgrade to versions >0.19.3 because this issue is breaking for us. This still seems to be happening on polars==0.20.31. Any ideas on a fix or timeline for looking into it?