Closed Stigjb closed 6 years ago
In [1]: from pandas.util.testing import rands_array
In [2]: df = pd.DataFrame({'A': rands_array(10, 10000), 'B': rands_array(100, 10000)})
In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
A 10000 non-null object
B 10000 non-null object
dtypes: object(2)
memory usage: 156.3+ KB
In [5]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
A 10000 non-null object
B 10000 non-null object
dtypes: object(2)
memory usage: 2.1 MB
@Stigjb you are mis-understanding the measure of memory. the '+' is accurate for numerical types as the memory is allocated and managed by (numpy); however for object
dtypes, this does not include the actual object storage itself, passing memory_usage='deep'
calculates this as well.
In [6]: df.memory_usage()
Out[6]:
Index 80
A 80000
B 80000
dtype: int64
In [8]: df.memory_usage(deep=True)
Out[8]:
Index 80
A 670000
B 1570000
dtype: int64
Thank you! I tried the same query as above again, and the reported memory usage of the dataframe with the memory_usage='deep'
parameter was 408.8 MB this time. The RAM usage is still 10 times higher than this value, is that just to be expected for the future as well?
I have noticed that reading large tables uses far more RAM than I think it should. For example, the following snippet that queries a public table uses around 4 GiB of memory while loading the pages (34 in total), as I watch the process in the system monitor. The final Dataframe, however, only uses 88.5+ MB, as told by
df.info()
.I know that
read_gbq
keeps the partial pages in a list before concatenating them all into a result DataFrame, but it seems to me that this should not require 50 times the memory, but rather an order of magnitude less.Here is the actual output from
df.info()
:Pandas version: 0.20.3 Python version: 3.5.2