Open mar-ses opened 1 year ago
Can you give this a try on pandas 1.5.1 or main?
Yes, the values are quite different, but the issue is the same. Partly, the reason for the different values could be because I ran the main example in Jupyter, while this one I ran directly as a file script. I didn't fully understand why I get this big 2GB offset in the main example.
Here is the output:
Pandas version: 1.5.1
Initial memory usage: 62.513152 MB
Memory usage after iteration: 478.826496 MB
Memory usage after iteration: 1270.603776 MB
Memory usage after iteration: 1547.554816 MB
Memory usage after iteration: 1823.973376 MB
Memory usage after iteration: 2101.18656 MB
Memory usage after iteration: 2424.328192 MB
Memory usage after iteration: 3216.216064 MB
Memory usage after iteration: 3605.475328 MB
Memory usage after iteration: 3882.16832 MB
Memory usage after iteration: 4383.531008 MB
Memory usage of df: 1124.821968
Final memory usage: 4154.92096 MB
Overall, when constructing and testing the example I posted, as well as dealing with the real use case I had, I got the impression that the exact memory usage was very unstable and unexplainable.
E.g., I am sure that sometimes, one of the iterations wouldn't increase the memory, and other times it would increase it by double the normal step.
In the real example, sometimes the memory goes down, some times it goes up. I've been doing a deep dive and the jumps in memory usage are sometimes larger than any single DataFrame that I query or create anywhere. Other times usage goes down (but overall it's slowly inflating).
I don't know if htis is a common problem when profiling memory usage in python.
Also, isn't it a bit weird how the increments in memory usage are not regular? Sometimes it's 600 MB, sometimes 800 MB, sometimes 200. Even though each step should be identical. Do you konw why this could be?
Could it just be random timing of the garbage collector?
Also, if you run this on your system, what values do you get?
@mar-ses the memory leak happens because you are doing the most common pandas bad practice - continuously mutating dataframe in a loop, one cell at a time.
if you rewrite your lines in def get_df(N):
for i in df.index:
df.at[i, "c"] = {f"blabla_{j}": j for j in range(i)}
into below, the memory leak will disappear.
df["c"] = pd.Series([[{f"blabla_{j}": j for j in range(i)}] for i in df.index])
I am not sure how pandas team can fix this issue, but underlying problem is users are using dataframes as mutable variables just like any other python variable. While in reality the best practice is to treat dataframe as an immutable object and chain your transformations on dataframe by copying and creating new dataframe, and letting GC collect old dfs.
Very surprised to see that actually; it was my impression that it was actually recommended that you first create the dataframe, with its full size allocated but empty, and then modify its elements, instead of adding/appending rows. Because adding rows results in constant array creation and is less efficient.
At least that's what I thought I heard from places like stackoverflow, perhaps I was mistaken.
@mar-ses modifying df is definitely not recommended, because when you modify one cell - it invalidates entire memory block behind it, that stores values other nearby cells. Doing it in a loop for every cell - and you can see how much waste will be created by allocating and invalidating memory blocks at each iteration.
For details about BlockManager you can read this blog https://uwekorn.com/2020/05/24/the-one-pandas-internal.html
this blog post also contains few other really good recommendations for other pandas anti-patterns https://www.aidancooper.co.uk/pandas-anti-patterns/
I'm now seeing
Initial memory usage: 79.9744 MB
Memory usage after iteration: 442.945536 MB
Memory usage after iteration: 247.074816 MB
Memory usage after iteration: 292.450304 MB
Memory usage after iteration: 345.153536 MB
Memory usage after iteration: 343.887872 MB
Memory usage after iteration: 369.041408 MB
Memory usage after iteration: 362.8032 MB
Memory usage after iteration: 409.530368 MB
Memory usage after iteration: 554.610688 MB
Memory usage after iteration: 431.632384 MB
@somurzakov is right that setting df.at[i, "c"]
is not encouraged, but that mostly for speed reasons, not memory usage. In fact, replacing .at[i, "c"]
loop with df["c"] = pd.Series([[{f"blabla_{j}": j for j in range(i)}] for i in df.index])
increases the memory footprint:
Memory usage after iteration: 445.48096 MB
Memory usage after iteration: 637.48096 MB
Memory usage after iteration: 807.624704 MB
Memory usage after iteration: 815.869952 MB
Memory usage after iteration: 846.483456 MB
Memory usage after iteration: 876.25728 MB
Memory usage after iteration: 974.516224 MB
Memory usage after iteration: 1005.826048 MB
Memory usage after iteration: 908.234752 MB
Memory usage after iteration: 946.753536 MB
@mar-ses can you confirm either of these results on main?
Also, any chance there is a typo in what you're trying to set? Each entry in df["c"] is a single-element list containing a decent-sized dict. Nested data is not encouraged.
Regarding the first question, I don't remember anymore and I don't think I have the example that was causing this at hand. I did try other ways of doing this than the example I gave, but I don't think I tried to create the series with such an inner list comprehension.
Regarding the second point, it's no typo, though I know it's discouraged. In this case, I was dealing with data from a database that includes a lot of "metadata" which is stored in json files, and I actually need almost all of the contents. The jsons are quite large and mostly of a fixed structure, but not exactly, so it would have been very awkward to try to expand them out first.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I think the simplest way is for me to just post this bare example which shows the memory leak (I had a similar use case in my code with some data that I was pulling from a database):
The output of this is:
To me, this looks like a memory leak. The extra 3.5 GB or so that the process is using cannot be accounted for. I tried to look into it further by counting the sizes of everything in
globals()
with this hacky idea (taken and modified from this page):and I get the same picture:
Am I doing something wrong or is this a true memory leak? Could it be related to the fact that I have all these dicts in the DataFrame? Does the space for their hash image not get deallocated or something? I don't really know how to debug this further.
I know some might say it's bad practice to have dicts in a DataFrame, but in a real-life example, I'm getting them because I query this data from a database, and some of the elements are json records and stuff like that.
Installed Versions
Prior Performance
No response