Open richardliaw opened 2 hours ago
@raulchen did some debugging and identified that there is some odd behavior from Pandas:
import pandas as pd
import numpy as np
import pickle
df = pd.DataFrame({
"data": [np.random.randint(size=1024, low=0, high=100, dtype=np.int8) for _ in range(1_000_000)]
})
print(df["data"].size, df["data"].dtype, df.memory_usage(index=True, deep=True).sum())
# 1000000 object 1144000132
df2 = pickle.loads(pickle.dumps(df))
print(df2["data"].size, df2["data"].dtype, df2.memory_usage(index=True, deep=True).sum())
# 1000000 object 120000132
Posted this on StackOverflow as well. https://stackoverflow.com/questions/79149716/pandas-memory-usage-inconsistent-for-in-line-numpy
The only known clue right now is that the "OWNDATA" flag for the numpy array is different.
Before pickle:
- dtype: int8
- flags: C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
- strides: (1,)
- data pointer: 54063872
After pickle:
- dtype: int8
- flags: C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
- strides: (1,)
- data pointer: 1022687168
While it's true that this is no longer an issue if the blocks are Arrow table, you'll still run into the issue if the blocks are pandas tables. This can happen if you use the
"pandas"
batch format, or if you use APIs likedrop_columns
that use the"pandas"
batch format under-the-hood.Here's a simple repro:
Output with pandas:
Output with PyArrow:
Originally posted by @bveeramani in https://github.com/ray-project/ray/issues/44577#issuecomment-2428035432