googleapis / python-bigquery-pandas

Google BigQuery connector for pandas
https://googleapis.dev/python/pandas-gbq/latest/index.html
BSD 3-Clause "New" or "Revised" License
447 stars 121 forks source link

read_gbq uses excessive amounts of memory #92

Closed Stigjb closed 6 years ago

Stigjb commented 6 years ago

I have noticed that reading large tables uses far more RAM than I think it should. For example, the following snippet that queries a public table uses around 4 GiB of memory while loading the pages (34 in total), as I watch the process in the system monitor. The final Dataframe, however, only uses 88.5+ MB, as told by df.info().

import pandas
query = ('SELECT * FROM `bigquery-public-data.noaa_gsod.gsod2016` WHERE mo = "01"')
df = pandas.read_gbq(query, project_id=<ID>, dialect='standard')
df.info()

I know that read_gbq keeps the partial pages in a list before concatenating them all into a result DataFrame, but it seems to me that this should not require 50 times the memory, but rather an order of magnitude less.

Here is the actual output from df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362568 entries, 0 to 362567
Data columns (total 32 columns):
stn                     362568 non-null object
wban                    362568 non-null object
year                    362568 non-null object
mo                      362568 non-null object
da                      362568 non-null object
temp                    362568 non-null float64
count_temp              362568 non-null int64
dewp                    362568 non-null float64
count_dewp              362568 non-null int64
slp                     362568 non-null float64
count_slp               362568 non-null int64
stp                     362568 non-null float64
count_stp               362568 non-null int64
visib                   362568 non-null float64
count_visib             362568 non-null int64
wdsp                    362568 non-null object
count_wdsp              362568 non-null object
mxpsd                   362568 non-null object
gust                    362568 non-null float64
max                     362568 non-null float64
flag_max                362568 non-null object
min                     362568 non-null float64
flag_min                362568 non-null object
prcp                    362568 non-null float64
flag_prcp               362568 non-null object
sndp                    362568 non-null float64
fog                     362568 non-null object
rain_drizzle            362568 non-null object
snow_ice_pellets        362568 non-null object
hail                    362568 non-null object
thunder                 362568 non-null object
tornado_funnel_cloud    362568 non-null object
dtypes: float64(10), int64(5), object(17)
memory usage: 88.5+ MB

Pandas version: 0.20.3 Python version: 3.5.2

jreback commented 6 years ago
In [1]: from pandas.util.testing import rands_array

In [2]: df = pd.DataFrame({'A': rands_array(10, 10000), 'B': rands_array(100, 10000)})

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
A    10000 non-null object
B    10000 non-null object
dtypes: object(2)
memory usage: 156.3+ KB

In [5]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
A    10000 non-null object
B    10000 non-null object
dtypes: object(2)
memory usage: 2.1 MB

@Stigjb you are mis-understanding the measure of memory. the '+' is accurate for numerical types as the memory is allocated and managed by (numpy); however for object dtypes, this does not include the actual object storage itself, passing memory_usage='deep' calculates this as well.

jreback commented 6 years ago
In [6]: df.memory_usage()
Out[6]: 
Index       80
A        80000
B        80000
dtype: int64

In [8]: df.memory_usage(deep=True)
Out[8]: 
Index         80
A         670000
B        1570000
dtype: int64
Stigjb commented 6 years ago

Thank you! I tried the same query as above again, and the reported memory usage of the dataframe with the memory_usage='deep' parameter was 408.8 MB this time. The RAM usage is still 10 times higher than this value, is that just to be expected for the future as well?