pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

PERF: Difference in using zipped pickle files #59279

Open buhtz opened 2 months ago

buhtz commented 2 months ago

Pandas version checks

Reproducible Example

The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.

#!/usr/bin/env python3
import io
import zipfile
from datetime import datetime
import pandas as pd
import numpy as np

FN = 'df.pickle.zip'

def create_and_zip_pickle_data():
    num_rows = 1_000_000
    num_cols = 10

    print('Create data frame')

    int_data = np.random.randint(0, 100, size=(num_rows, num_cols // 2))
    str_choices = np.array(['Troi', 'Crusher', 'Yar', 'Guinan'])
    str_data = np.random.choice(str_choices, size=(num_rows, num_cols // 2))
    columns = [f'col_{i}' for i in range(num_cols)]

    df = pd.DataFrame(np.hstack((int_data, str_data)), columns=columns)
    df_one = df.copy()

    for _ in range(20):
        df = pd.concat([df, df_one])

    df = df.reset_index()
    df['col_2'] = df['col_2'].astype('Int16')
    df['col_4'] = df['col_4'].astype('Int16')
    df['col_5'] = df['col_5'].astype('category')
    df['col_7'] = df['col_7'].astype('category')
    df['col_9'] = df['col_9'].astype('category')

    print(df.head())

    print(f'Pickle {len(df):n} rows')
    df.to_pickle(FN)

def unpickle_via_pandas():
    timestamp = datetime.now()
    print('Unpickle with pandas')

    df = pd.read_pickle(FN)
    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

def unpickle_from_memory():
    timestamp = datetime.now()
    print('Unpickle after unzipped into RAM')

    # Unzip into RAM
    print('Unzip into RAM')
    with zipfile.ZipFile(FN) as zf:
        stream = io.BytesIO(zf.read(zf.namelist()[0]))

    # Unpickle from RAM
    print('Unpickle from RAM')
    df = pd.read_pickle(stream)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

def unpickle_zip_filehandle():
    timestamp = datetime.now()
    print('Unpickle with zip filehandle')

    with zipfile.ZipFile(FN) as zf:
        with zf.open('df.pickle') as handle:
            print('Unpickle from filehandle')
            df = pd.read_pickle(handle)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

if __name__ == '__main__':
    print(f'{pd.__version__=}')
    # create_and_zip_pickle_data()
    print('-'*20)
    unpickle_from_memory()
    print('-'*20)
    unpickle_via_pandas()
    print('-'*20)
    unpickle_zip_filehandle()
    print('-'*20)
    print('FIN')

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.0 Cython : None pytest : 7.4.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : 2023.10.1 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.

Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an io.BytesIO() object and using this with pandas.read_pickle() (6sec in my example).

In the example code below the function unpickle_from_memory() demonstrate the fast way. The slower one is unpickle_via_pandas() and unpickle_zip_filehandle(). The later might be an example about how pandas work internally with that zip file.

Here is the output from the script:

pd.__version__='2.2.2'
--------------------
Unpickle after unzipped into RAM
Unzip into RAM
Unpickle from RAM
21000000 rows. Duration 0:00:06.289123.
--------------------
Unpickle with pandas
21000000 rows. Duration 0:01:51.749488.
--------------------
Unpickle with zip filehandle
Unpickle from filehandle
21000000 rows. Duration 0:01:50.909909.
--------------------
FIN

My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in unpickle_from_memory()?

phofl commented 1 month ago

Investigations are welcome, I think you will have to dig into this yourself. We focus on other file formats like parquet these days and would generally recommend this for users.

buhtz commented 1 month ago

I wouldn't ask that question if I hadn't tried to dig into the pandas code myself before.

phofl commented 1 month ago

I just wanted to warn you that this is probably not a priority for us

buhtz commented 1 month ago

I just wanted to warn you that this is probably not a priority for us

Thank you for clarifying that Patrick. As a maintainer I have always had good experiences with communicating such things (the priorities of the project and the maintainers) unsolicited and transparently. This ease up communication and increase empathy on both sides.