Closed hkad98 closed 3 months ago
Hm I can't reproduce this. Also on an ARM Mac
29 141.8 MiB 34.1 MiB 1 grouped_df.unstack("F")
Can you provide a simple reproducer? Less columns and without groupby
@phofl I will try to find a simpler reproducer. May I ask how grouped_df.unstack("F")
in your reproducer was executed on line 29? Did you add any lines? What version of pandas did you use? What is wrong with groupby
?
I added one line to activate/deactivate CoW. Otherwise this was copied as is. Examples should always be as simple as possible, e.g. no unnecessary operations. If your problem occurs in unstack, then groupby is unnecessary
@phofl what about the following code?
import pandas as pd
from memory_profiler import profile
@profile
def main():
df = pd.read_parquet("reproducer.parquet")
df.unstack("F")
if __name__ == "__main__":
main()
I put reproducer.parquet in archiv.zip.
Line # Mem usage Increment Occurrences Line Contents
=============================================================
11 77.0 MiB 77.0 MiB 1 @profile
12 def main():
13 115.3 MiB 38.3 MiB 1 df = pd.read_parquet("reproducer.parquet")
14 290.0 MiB 174.8 MiB 1 df.unstack("F")
Still getting huge memory consumption for unstack.
Nope, still can't reproduce
Line # Mem usage Increment Occurrences Line Contents
=============================================================
4 146.7 MiB 146.7 MiB 1 @profile
5 def main():
6 178.8 MiB 32.1 MiB 1 df = pd.read_parquet("reproducer.parquet")
7 215.9 MiB 37.0 MiB 1 df.unstack("F")
That is strange. What version of Pandas do you use?
tried on main and 2.0.3
Hi @phofl, I tried to reproduce the issue independently, and I think I succeeded. See the following runs in my public repo.
Ubuntu: https://github.com/hkad98/pandas-reproducer/actions/runs/5784679774/job/15675843719 MacOS: https://github.com/hkad98/pandas-reproducer/actions/runs/5784759695/job/15676084594
Note that both runners use x86
Unstack increment for:
Unfortunately, GitHub does not provide runners with ARM architecture. I tried locally running the same script with Ubuntu + ARM, and the results were the same as Ubuntu with x86. I think that issue is in the combination of macOS13 with ARM.
Seems like the original issue wasn't directly reproducible so closing.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I use memory_profiler. Run
mprof run file_name
to measure memory consumption on each line of code.Memory usage for unstack:
I tried to improve memory consumption with the following changes:
category
instead ofstring
pivot_table
instead ofgroupby
+unstack
reset_index
+pivot
instead ofunstack
None of the things above worked.
I tried to turn on CoW, but it made it worse.
I am aware that CoW does not support unstack yet (https://github.com/pandas-dev/pandas/issues/49473), but I would not expect that turning it on will make unstack worse.
Installed Versions
Prior Performance
No response