PERF: Difference in using zipped pickle files

buhtz commented 2 months ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.

#!/usr/bin/env python3
import io
import zipfile
from datetime import datetime
import pandas as pd
import numpy as np

FN = 'df.pickle.zip'

def create_and_zip_pickle_data():
    num_rows = 1_000_000
    num_cols = 10

    print('Create data frame')

    int_data = np.random.randint(0, 100, size=(num_rows, num_cols // 2))
    str_choices = np.array(['Troi', 'Crusher', 'Yar', 'Guinan'])
    str_data = np.random.choice(str_choices, size=(num_rows, num_cols // 2))
    columns = [f'col_{i}' for i in range(num_cols)]

    df = pd.DataFrame(np.hstack((int_data, str_data)), columns=columns)
    df_one = df.copy()

    for _ in range(20):
        df = pd.concat([df, df_one])

    df = df.reset_index()
    df['col_2'] = df['col_2'].astype('Int16')
    df['col_4'] = df['col_4'].astype('Int16')
    df['col_5'] = df['col_5'].astype('category')
    df['col_7'] = df['col_7'].astype('category')
    df['col_9'] = df['col_9'].astype('category')

    print(df.head())

    print(f'Pickle {len(df):n} rows')
    df.to_pickle(FN)

def unpickle_via_pandas():
    timestamp = datetime.now()
    print('Unpickle with pandas')

    df = pd.read_pickle(FN)
    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

def unpickle_from_memory():
    timestamp = datetime.now()
    print('Unpickle after unzipped into RAM')

    # Unzip into RAM
    print('Unzip into RAM')
    with zipfile.ZipFile(FN) as zf:
        stream = io.BytesIO(zf.read(zf.namelist()[0]))

    # Unpickle from RAM
    print('Unpickle from RAM')
    df = pd.read_pickle(stream)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

def unpickle_zip_filehandle():
    timestamp = datetime.now()
    print('Unpickle with zip filehandle')

    with zipfile.ZipFile(FN) as zf:
        with zf.open('df.pickle') as handle:
            print('Unpickle from filehandle')
            df = pd.read_pickle(handle)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

if __name__ == '__main__':
    print(f'{pd.__version__=}')
    # create_and_zip_pickle_data()
    print('-'*20)
    unpickle_from_memory()
    print('-'*20)
    unpickle_via_pandas()
    print('-'*20)
    unpickle_zip_filehandle()
    print('-'*20)
    print('FIN')

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.0 Cython : None pytest : 7.4.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : 2023.10.1 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.

Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an io.BytesIO() object and using this with pandas.read_pickle() (6sec in my example).

In the example code below the function unpickle_from_memory() demonstrate the fast way. The slower one is unpickle_via_pandas() and unpickle_zip_filehandle(). The later might be an example about how pandas work internally with that zip file.

Here is the output from the script:

pd.__version__='2.2.2'
--------------------
Unpickle after unzipped into RAM
Unzip into RAM
Unpickle from RAM
21000000 rows. Duration 0:00:06.289123.
--------------------
Unpickle with pandas
21000000 rows. Duration 0:01:51.749488.
--------------------
Unpickle with zip filehandle
Unpickle from filehandle
21000000 rows. Duration 0:01:50.909909.
--------------------
FIN

My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in unpickle_from_memory()?

phofl commented 1 month ago

Investigations are welcome, I think you will have to dig into this yourself. We focus on other file formats like parquet these days and would generally recommend this for users.

buhtz commented 1 month ago

I wouldn't ask that question if I hadn't tried to dig into the pandas code myself before.

phofl commented 1 month ago

I just wanted to warn you that this is probably not a priority for us

buhtz commented 1 month ago

I just wanted to warn you that this is probably not a priority for us

Thank you for clarifying that Patrick. As a maintainer I have always had good experiences with communicating such things (the priorities of the project and the maintainers) unsolicited and transparently. This ease up communication and increase empathy on both sides.

pandas-dev / pandas