pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.8k forks source link

PERF: high memory consumption for unstack #54373

Closed hkad98 closed 3 months ago

hkad98 commented 1 year ago

Pandas version checks

Reproducible Example

I use memory_profiler. Run mprof run file_name to measure memory consumption on each line of code.

import pandas as pd
import random
import string
from memory_profiler import profile

def random_string():
    return ''.join(random.choices(string.ascii_letters, k=7))

@profile
def main():
    records_count = 63531
    df = pd.DataFrame(
        {
            "A": random.choices([random_string() for _ in range(24)], k=records_count),
            "B": random.choices([random_string() for _ in range(14580)], k=records_count),
            "C": random.choices([random_string() for _ in range(9)], k=records_count),
            "D": random.choices([random_string() for _ in range(2311)], k=records_count),
            "E": random.choices([random_string() for _ in range(2)], k=records_count),
            "F": random.choices([random_string() for _ in range(280)], k=records_count),
            "M": random.sample(range(0, records_count), records_count)
        }
    )

    grouped_df = df.groupby(["A", "B", "C", "D", "E", "F"], dropna=False)[["M"]].sum(min_count=1, numeric_only=False)
    grouped_df.unstack("F")

if __name__ == "__main__":
    main()

Memory usage for unstack:

    27    264.1 MiB    171.8 MiB           1       grouped_df.unstack("F")

I tried to improve memory consumption with the following changes:

None of the things above worked.

I tried to turn on CoW, but it made it worse.

  28    390.6 MiB    306.1 MiB           1       grouped_df.unstack("F")

I am aware that CoW does not support unstack yet (https://github.com/pandas-dev/pandas/issues/49473), but I would not expect that turning it on will make unstack worse.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0f437949513225922d851e9581723d82120684a6 python : 3.10.8.final.0 python-bits : 64 OS : Darwin OS-release : 22.5.0 Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : None.UTF-8 pandas : 2.0.3 numpy : 1.25.2 pytz : 2023.3 dateutil : 2.8.2 setuptools : 63.2.0 pip : 22.2.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

No response

phofl commented 1 year ago

Hm I can't reproduce this. Also on an ARM Mac

 29    141.8 MiB     34.1 MiB           1       grouped_df.unstack("F")

Can you provide a simple reproducer? Less columns and without groupby

hkad98 commented 1 year ago

@phofl I will try to find a simpler reproducer. May I ask how grouped_df.unstack("F") in your reproducer was executed on line 29? Did you add any lines? What version of pandas did you use? What is wrong with groupby?

phofl commented 1 year ago

I added one line to activate/deactivate CoW. Otherwise this was copied as is. Examples should always be as simple as possible, e.g. no unnecessary operations. If your problem occurs in unstack, then groupby is unnecessary

hkad98 commented 1 year ago

@phofl what about the following code?

import pandas as pd
from memory_profiler import profile

@profile
def main():
    df = pd.read_parquet("reproducer.parquet")
    df.unstack("F")

if __name__ == "__main__":
    main()

I put reproducer.parquet in archiv.zip.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    11     77.0 MiB     77.0 MiB           1   @profile
    12                                         def main():
    13    115.3 MiB     38.3 MiB           1       df = pd.read_parquet("reproducer.parquet")
    14    290.0 MiB    174.8 MiB           1       df.unstack("F")

Still getting huge memory consumption for unstack.

phofl commented 1 year ago

Nope, still can't reproduce

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4    146.7 MiB    146.7 MiB           1   @profile
     5                                         def main():
     6    178.8 MiB     32.1 MiB           1       df = pd.read_parquet("reproducer.parquet")
     7    215.9 MiB     37.0 MiB           1       df.unstack("F")
hkad98 commented 1 year ago

That is strange. What version of Pandas do you use?

phofl commented 1 year ago

tried on main and 2.0.3

hkad98 commented 1 year ago

Hi @phofl, I tried to reproduce the issue independently, and I think I succeeded. See the following runs in my public repo.

Ubuntu: https://github.com/hkad98/pandas-reproducer/actions/runs/5784679774/job/15675843719 MacOS: https://github.com/hkad98/pandas-reproducer/actions/runs/5784759695/job/15676084594

Note that both runners use x86

Unstack increment for:

Unfortunately, GitHub does not provide runners with ARM architecture. I tried locally running the same script with Ubuntu + ARM, and the results were the same as Ubuntu with x86. I think that issue is in the combination of macOS13 with ARM.

mroeschke commented 3 months ago

Seems like the original issue wasn't directly reproducible so closing.