pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.58k forks source link

BUG: concat zero-copy breaks #54991

Open mlkui opened 10 months ago

mlkui commented 10 months ago

Pandas version checks

Reproducible Example

import os

import pandas as pd
import psutil
import pyarrow as pa

if __name__ == "__main__":
    pid = os.getpid()
    process = psutil.Process(pid)

    with pa.memory_map("date1.arrow", 'r') as source1, pa.memory_map("v1.arrow", 'r') as source2, \
            pa.memory_map("date2.arrow", 'r') as source3, pa.memory_map("v2.arrow", 'r') as source4:
        c1 = pa.ipc.RecordBatchFileReader(source1).read_all().column("date")
        c2 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
        c3 = pa.ipc.RecordBatchFileReader(source3).read_all().column("date")
        c4 = pa.ipc.RecordBatchFileReader(source4).read_all().column("v")
        print("RSS of PyArrow: {}MB".format(pa.total_allocated_bytes() >> 20))  # RSS of PyArrow: 0MB

        s1 = c1.to_pandas(zero_copy_only=True)
        s2 = c2.to_pandas(zero_copy_only=True)
        s3 = c3.to_pandas(zero_copy_only=True)
        s4 = c4.to_pandas(zero_copy_only=True)
        print("RSS of PyArrow: {}MB".format(pa.total_allocated_bytes() >> 20))  # RSS of PyArrow: 0MB

        dfs = {"date": s1, "v": s2}
        df1 = pd.concat(dfs, axis=1, copy=False)  # zero-copy
        print("RSS of PyArrow: {}MB".format(pa.total_allocated_bytes() >> 20))  # RSS of PyArrow: 0MB

        dfs2 = {"date": s3, "v": s4}
        df2 = pd.concat(dfs2, axis=1, copy=False)  # zero-copy
        print("RSS of PyArrow: {}MB".format(pa.total_allocated_bytes() >> 20))  # RSS of PyArrow: 0MB

        print(df1)
        print(df2)

        memory_info = process.memory_info()
        print(f"RSS (Resident Set Size): {memory_info.rss / 1024 / 1024} bytes")
        print(f"VMS (Virtual Memory Size): {memory_info.vms / 1024 / 1024} bytes")

        # NOT zero-copy
        result_df = pd.concat([df1, df2], axis=0, copy=False)
        print("RSS of PyArrow: {}MB".format(pa.total_allocated_bytes() >> 20))  # RSS of PyArrow is 0MB, but copy happens in Python. Using a large arrow can make the issue more apparent.
        print(result_df)

        memory_info = process.memory_info()
        print(f"RSS (Resident Set Size): {memory_info.rss / 1024 / 1024} bytes")
        print(f"VMS (Virtual Memory Size): {memory_info.vms / 1024 / 1024} bytes")

Issue Description

Arrow table concat and pandas.concat should be zero-copy, but when concat two zero-copy dataframe(convert from arrow table), copy happens even pandas COW is turned on.

Also, currently, trying to concat two arrow table and then convert the table to dataframe with zero_copy_only=True is also not allowed as the chunknum>1.

Expected Behavior

When using pandas.concat to concatenate two zero-copy dataframes (converted from Arrow tables), it should not involve any copying.

Installed Versions

pandas 2.0.3 and pyarrow 12.0 pandas 2.1.0 and pyarrow 13.0
phofl commented 10 months ago

Hi, thanks for your report. Could you include data so that your example is copy-pasteable?

mlkui commented 10 months ago

Hi, thanks for your report. Could you include data so that your example is copy-pasteable?

@phofl Sure, I have modified the previous code to make it copy-pasteable. You can download the small test data from https://github.com/mlkui/pandas_test_data Using a large arrow can make the issue more apparent.

mlkui commented 9 months ago

@phofl Hi, dose the problem exist?