Open kevfly16 opened 11 months ago
What do you mean zero-copy? When you access memory mapped data, you will hit a page fault and the OS will load that data from disk into memory.
My data is all in memory already on a mounted tmpfs volume. There should never be a page fault.
When I say zero-copy, I mean that the data stored on the file never needs to be copied to in-process memory. Polars will create some objects but the underlying data will be backed by the file. If that file is already in memory, we shouldn’t see any copies. This was true in 0.19.3, but stopped being true in all later versions. Since 0.19.4, some operations will duplicate the data, one version stored on tmpfs and another duplicated in the Python process.
As another example, this won’t copy any data and performs as it should:
df[1]
df[2]
df[1000]
But this will cause the latest polars to make a copy of the entire DataFrame in process:
df[[1,2,1000]]
When I say zero-copy, I mean that the data stored on the file never needs to be copied to in-process memory.
But that's not how memory mapping works. The os will read pages from disk once you access that data.
Can you maybe create a minimal example that shows the problem?
It seems like the issue stems from using pyarrow types. I couldn't figure out how to get an Arrow V2 file stored with pyarrow types using polars alone, so I used pandas instead. I tested with pyarrow integers, doubles, and large_strings and found the same issue. A full, reproducible example using doubles below.
I'm using a linux machine with 1 vCPU and 1 GB RAM (AWS t2.micro):
1. Mount RAM disk and update current directory
mkdir ramdisk
mount -t tmpfs -o size=1G tmpfs ramdisk
cd ramdisk
2. Install dependencies (note using version 0.19.3 of polars)
python3 -m ensurepip
pip3 install numpy==1.26.0 pandas==2.1.1 pyarrow==13.0.0 polars==0.19.3
3. Open a python shell
# python3
4. Create some random data and store it in an Arrow V2 file [~153 MB]
import numpy as np
import pandas as pd
size = 10_000_000
df = pd.DataFrame({ "a": np.random.rand(size), "b": np.random.rand(size) }, dtype="double[pyarrow]")
df.to_feather("example.feather", compression="uncompressed")
5. Close the python shell and reopen
>>> exit()
# python3
6. Read example.feather as a memory mapped file from the RAM disk
import polars as pl
df = pl.read_ipc("example.feather", memory_map=True)
7. Observe memory usage in a separate shell / window
ps aux | head -n1; ps aux | grep python3
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5304 2.5 4.3 469172 42716 pts/0 S+ 15:13 0:00 python3
8. Access dataframe
df[100], df[101]
9. Observe memory usage
ps aux | head -n1; ps aux | grep python3
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5304 0.3 4.3 469172 42716 pts/0 S+ 15:13 0:00 python3
10. Access dataframe at multiple positions
df[[100, 101]]
11. Observe memory usage
ps aux | head -n1; ps aux | grep python3
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5304 0.2 4.5 537016 45208 pts/0 Sl+ 15:13 0:00 python3
12. Now install the latest version of polars as of October 7, 2023
pip3 install -v polars==0.19.7
13. Repeating steps 6-10 above, memory usage spikes after step 10 using latest version of polars
ps aux | head -n1; ps aux | grep python3
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 6078 0.7 19.4 653888 192812 pts/0 Sl+ 15:15 0:00 python3
When a file is already stored in memory (e.g., RAM disk), the OS never needs to read the memory mapped file from disk into in-process memory. Since we're using the Arrow memory format to store our files, we can leverage zero-copy reads without any serialization overhead except for creating the high level python classes (e.g., dataframe library objects). Across all versions, accessing the rows independently (step 8
) is zero-copy. In version 0.19.3 we were also able to access multiple positions at the same time (step 9
) with zero-copy as well, but a regression seems to have been introduced since version 0.19.4. This affects other polars methods like join too.
@stinodego @ritchie46 Really appreciate all the work that you put into this package. Polars has been working really well for us, but we can't upgrade to versions >0.19.3 because this issue is breaking for us. This still seems to be happening on polars==0.20.31
. Any ideas on a fix or timeline for looking into it?
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
The line
df[[1,2]]
will start copying the entire DataFrame to in-process memory.Log output
No response
Issue description
When reading an arrow-backed DataFrame (e.g.,
read_ipc
) from RAM disk using the optionmemory_map=True
, polars creates a zero-copy DataFrame. In version 0.19.3, indexing the DataFrame by multiple integer positions or performing a join on an already sorted DataFrame would also be zero-copy. Since version 0.19.4, these operations are no longer zero-copy and instead copy the DataFrame to in-process memory. With a large enough file this becomes obvious because the operations take a lot longer and, in some cases, the process runs out of memory.Expected behavior
Similar to the behavior in version 0.19.3, I would expect that these operations remain zero-copy in newer versions.
Installed versions