googleapis / google-api-python-client

🐍 The official Python client library for Google's discovery based APIs.
https://googleapis.github.io/google-api-python-client/docs/
Apache License 2.0
7.76k stars 2.41k forks source link

Possible memory leak when using Google Sheets API #535

Closed mrc0mmand closed 4 years ago

mrc0mmand commented 6 years ago

There seems to be a memory leak when using the google-api-client with GSheets.

Environment:

$ python --version
Python 3.6.4
$ pip show google-api-python-client
Name: google-api-python-client
Version: 1.7.3

Here's a simple reproducer (without a .client_secret.json):

#!/usr/bin/env python3

import httplib2
import os
from apiclient import discovery
from memory_profiler import profile
from oauth2client import client, tools
from oauth2client.file import Storage
from time import sleep

SCOPES = "https://www.googleapis.com/auth/spreadsheets.readonly"
# See https://cloud.google.com/docs/authentication/getting-started
CLIENT_SECRET_FILE = ".client_secret.json"
APPLICATION_NAME = "ClientDebug"
DISCOVERY_URL = "https://sheets.googleapis.com/$discovery/rest?version=v4"

def get_credentials():
    home_dir = os.path.expanduser("~")
    credential_dir = os.path.join(home_dir, ".credentials")
    flags = None
    if not os.path.exists(credential_dir):
        os.makedirs(credential_dir)
    credential_path = os.path.join(credential_dir,
                                "sheets.googleapis.com-clientdebug.json")

    store = Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        credentials = tools.run_flow(flow, store, flags)

    return credentials

@profile(precision=4)
def get_responses(creds):
    """Fetch spreadsheet data."""
    sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"

    http = creds.authorize(httplib2.Http())
    service = discovery.build("sheets", "v4", http=http,
                discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    print("Got {} rows".format(len(values)))

if __name__ == "__main__":
    creds = get_credentials()
    for i in range(0, 50):
        get_responses(creds)
        sleep(2)

For measurements I used memory_profiler module with following results:

First and second iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  26.5195 MiB  26.5195 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  26.5195 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  26.5195 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  26.5195 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  29.2891 MiB   2.7695 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  49.5742 MiB  20.2852 MiB       result = service.spreadsheets().values().get(
    44  49.5742 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  49.5742 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  49.5742 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  49.5742 MiB  49.5742 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  49.5742 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  49.5742 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  49.5742 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  49.5742 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  67.9922 MiB  18.4180 MiB       result = service.spreadsheets().values().get(
    44  67.9922 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  67.9922 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  67.9922 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

Last iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35 229.6055 MiB 229.6055 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38 229.6055 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40 229.6055 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41 229.6055 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42 229.6055 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43 229.6055 MiB   0.0000 MiB       result = service.spreadsheets().values().get(
    44 229.6055 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45 229.6055 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47 229.6055 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

There's clearly a memory leak, as the reproducer fetches the same data over and over again, yet the memory consumption keeps rising. Full log can be found here.

As a temporary workaround for one of my long-running applications I use an explicit garbage collector call, which mitigates this issue, at least for now:

...
import gc
...
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    gc.collect()
...

I went a little deeper, and the main culprit seems to be in the createMethod function when creating dynamic method batchUpdate:

Method 'batchUpdate, approx. __doc__ size: 2886834                                                                                                                                             
<class 'function'>                                                                                                                                                            
Filename: /home/fsumsal/venv/googleapiclient/lib64/python3.6/site-packages/googleapiclient/discovery.py                                                                       

Line #    Mem usage    Increment   Line Contents                                                                                                                              
================================================                                                                                                                              
  1064     48.7 MiB     48.7 MiB     @profile                                                                                                                                 
  1065                               def _add_basic_methods(self, resourceDesc, rootDesc, schema):                                                                            
  1066                                 # If this is the root Resource, add a new_batch_http_request() method.                                                                 
  1067     48.7 MiB      0.0 MiB       if resourceDesc == rootDesc:                                                                                                           
...                          
  1086                                                                                                                                                                        
  1087                                 # Add basic methods to Resource                                                                                                        
  1088     48.7 MiB      0.0 MiB       if 'methods' in resourceDesc:                                                                                                          
  1089     66.8 MiB      0.0 MiB         for methodName, methodDesc in six.iteritems(resourceDesc['methods']):                                                                
  1090     56.0 MiB      0.0 MiB           fixedMethodName, method = createMethod(                                                                                            
  1091     66.8 MiB     18.1 MiB               methodName, methodDesc, rootDesc, schema)                                                                                      
  1092     66.8 MiB      0.0 MiB           print(type(method))                                                                                                                
  1093     66.8 MiB      0.0 MiB           self._set_dynamic_attr(fixedMethodName,             
  1094     66.8 MiB      0.0 MiB                                  method.__get__(self, self.__class__))                                                                                        
  1095                                     # Add in _media methods. The functionality of the attached method will                                                                              
  1096                                     # change when it sees that the method name ends in _media.                                                                                          
  1097     66.8 MiB      0.0 MiB           if methodDesc.get('supportsMediaDownload', False):  
  1098                                       fixedMethodName, method = createMethod(           
  1099                                           methodName + '_media', methodDesc, rootDesc, schema)                                                                                          
  1100                                       self._set_dynamic_attr(fixedMethodName,           
  1101                                                              method.__get__(self, self.__class__))

(This method has a huge docstring.)

Nevertheless, there is probably a reference loop somewhere, as the gc.collect() call manages to collect all those unreachable objects.

mcdonc commented 6 years ago

I've been able to replicate this with a variant of your script, thanks!

I think it may be possible to cache the result of createMethod such that we don't create a new closure for every call, which would make the memory leak less dramatic (probably not noticeable), and faster. But before I do that I'll try to identify the actual cycle.

mcdonc commented 6 years ago

Running debugging commentary for future reference:

Code:

creds = get_credentials()
http = creds.authorize(httplib2.Http())
service = discovery.build("sheets", "v4", http=http,
            discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
sheets = service.spreadsheets()
for i in range(0, 50):
    get_responses(sheets)
    sleep(2) # no leak

Code:

def __del__(self):
  for dynamic_attr in self._dynamic_attrs:
    del self.__dict__[dynamic_attr]
  del self._dynamic_attrs

Code:

if __name__ == "__main__":
    creds = get_credentials()
    for i in range(0, 100):
        get_responses(creds)
        objgraph.show_growth() # prints nothing
        sleep(1)

Code:

def close(self):
  for dynamic_attr in self._dynamic_attrs:
    del self.__dict__[dynamic_attr]
  self._dynamic_attrs = []

Code:

def _bind_weakmethod(self, attr_name, func):
  instance_ref = weakref.ref(self)
  cls = self.__class__
  def themethod(*arg, **kw):
    instance = instance_ref()
    if instance is None:
      raise ValueError('Cannot call function; instance was purged')
    function = func.__get__(instance, cls)
    return function(*arg, **kw)
  self._set_dynamic_attr(attr_name, themethod)

And then eg.:

# Add basic methods to Resource
if 'methods' in resourceDesc:
  for methodName, methodDesc in six.iteritems(resourceDesc['methods']):
    fixedMethodName, method = createMethod(
        methodName, methodDesc, rootDesc, schema)
    self._bind_weakmethod(fixedMethodName, method)
mcdonc commented 6 years ago

This is the "leak", a very simple circref. It cannot be avoided without a redesign.

gapi-circref

It is caused by implementing "dynamic methods" using the descriptor protocol, ala https://github.com/google/google-api-python-client/blob/master/googleapiclient/discovery.py#L1088

If there is a way around this circref, I don't know it.

mcdonc commented 6 years ago

Now that we've figured why we perceive there is a leak, a) is it a problem? and b), if so, what can we do about it?

Is it a problem?

Yes and no.

Theoretically, no, it's not a problem. Reference cycles in programs are normal, and Python's garbage collector will eventually trash the objects involved in cycles. Every time a Resource object is created (by calling methods on other Resource objects), we get some cycles between that Resource and its dynamic methods, and, in theory, this is fine.

Practically, yes, it's a problem. Repeated Resource creatoin causes the process RSS to bloat, and, on Linux at least, the memory consumed by these references is not given back to the OS due to memory fragmentation, even after the cycles are broken.

What can we do about it?

I've put it some work on a branch (https://github.com/mcdonc/google-api-python-client/tree/fix-createmethod-memleak-535) trying to make the symptoms slightly better.

Try #1 on that branch, which is now a few commits back, and isn't represented by the current state of the branch, was caching the functions that become methods on Resource objects, only creating one function per input instead of one per call. This is not a reasonable fix, however, because refs involved in cycles still grow; every time a Resource is instantiated, it binds itself to some number of methods, and even if the functions representing these methods are not repeatedly created, the act of binding cached methods to each still creates cycles.

Try #2, which represents the current state of the branch, dynamically creates and caches one Resource class per set of inputs, instead of just caching the result of dynamic method creation. This disuses the descriptor protocol to bind dynamic methods to instances, so the only circrefs are those as if each resource type had its own class in sys.modules['googleapiclient.discovery']. The number of circrefs is dramatically reduced, and RSS growth is bounded after the first call of the replication script (unlike master, where it grows without bound on each call, although every so often gc kicks in and brings it down a little). According to gc.set_debug(gc.DEBUG_LEAK) under py 3.6, he length of gc.garbage is 2214 after 40 iterations of the reproducer script for-loop, instead of master's gargantuan 45218. And I believe we could bring that down more by fixing an unrelated different leak. However, the resulting instances cannot be pickled, which is, I believe, part of their API.

So I think we have these options:

mrc0mmand commented 6 years ago

Thank you very much for the thorough analysis. As you already said, a complete fix would require some time to implement, so the temporary fix using a cache along with gc.collect() is good enough for now at keeping the memory consumption at bay.

Again, thanks!

mcdonc commented 6 years ago

@mrc0mmand yes, in your particular case, creating "sheets" only once would make it leak so little that you won't need gc.collect()

mcdonc commented 6 years ago

@theacodes can you advise about which one of the options in https://github.com/google/google-api-python-client/issues/535#issuecomment-404994715 is most appropriate?

hx2A commented 6 years ago

@mcdonc @theacodes If I get a vote I'd like to see the second option, adding a .close() method. I've spent the past week or so tracking down this same memory error and found my way to this page. It happens I specifically looked for close() methods in the Resource objects because I knew something somewhere wasn't being released.

Adding a .close() method seems cleaner than my having to call gc.collect(). Either way I have to do something to cleanup resources and calling .close() is analogous to what we do already for files and other things.

In any case, this issue should be mentioned in the documentation and sample code, please!

willjhenry commented 4 years ago

I have a cron job, on google app engine, that reads data in from a google sheet. I am noticing the same memory leak (or maybe a different memory leak?). I tried the recommend work arounds: 1. creating the "sheets" object only once, and use gc.collect(). Neither worked in my case. As a a test, I changed the few lines of code that read data from a google sheet to read data from a database table, and the memory leak went away.

dcoleyoung commented 4 years ago

I have a cron job, on google app engine, that reads data in from a google sheet. I am noticing the same memory leak (or maybe a different memory leak?). I tried the recommend work arounds: 1. creating the "sheets" object only once, and use gc.collect(). Neither worked in my case. As a a test, I changed the few lines of code that read data from a google sheet to read data from a database table, and the memory leak went away.

Can you help me to clarify your last sentence here? Did you ever fix the code, or just confirm the memory leak? I'm in the same situation, app engine job that reads google sheet and started getting "Exceeded soft memory limit" errors. And like you the garbage collection suggestions did not help my situation.

willjhenry commented 4 years ago

I never fixed it... in the short term, I used a high mem appengine instance so that would take longer to hit the memory threshold and then, as a long term solution, I switched to airtable instead of google sheets.

dcoleyoung commented 4 years ago

I never fixed it... in the short term, I used a high mem appengine instance so that would take longer to hit the memory threshold and then, as a long term solution, I switched to airtable instead of google sheets.

Got it - I'll look into using airtable instead. Great suggestion. Appreciate the help.

AmosDinh commented 4 years ago

My solution in the end was just too use "pure" http requests

hx2A commented 4 years ago

@AmosDinh Can you elaborate? In my project the memory leak issue has reared its ugly head again and I am looking for new approaches to deal with it.

AmosDinh commented 4 years ago

@hx2A Well I don't use the the api at all,except for creating credentials. Here is an example from my class.

def buildCredentials(self,refresh=False):
   if refresh:
      self.credentials.refresh(httplib2.Http())
   self. credentials = ServiceAccountCredentials.from_json_keyfile_name(
            "myCreds.json",
            scopes=[
                'https://www.googleapis.com/auth/drive',
                'https://www.googleapis.com/auth/spreadsheets',

            ])
   delegated_credentials = credentials.create_delegated(
            [service_account_email])
   access_token = delegated_credentials.get_access_token().access_token
   self.headers = {
            "Authorization": "Bearer "+access_token,
        }

and

def spreadsheetAppend(self,spreadsheetId,list2d,secondcycle=False):
        valueInputOption="RAW"
        range='!A:A'
        url = f"https://sheets.googleapis.com/v4/spreadsheets/{spreadsheetId}/values/{range}:append?valueInputOption={valueInputOption}"
        data = json.dumps({
            "range":range,
            "majorDimension":"ROWS",
            "values":list2d
        })
        r = requests.post(url,headers=self.headers,data=data)
        if r.status_code>399:
             print(r,r.headers)
             if r.statuscode==401 and not secondcycle: 
                 self.buildCredentials(refresh=True) 
                 self.spreadsheetAppend(spreadsheetId,list2d,secondcycle=True)
hx2A commented 4 years ago

@AmosDinh Thank you, I understand now. This is helpful.

AmosDinh commented 4 years ago

@hx2A glad I could help you

AmosDinh commented 4 years ago

Alternatively, you could have the sheets api code in another process which you could terminate after execution / RAM usage hitting a certain threshold. - Just wanted to include that option

hx2A commented 4 years ago

@AmosDinh I'm trying to get this working now but I keep getting 403 'Forbidden' responses. I believe it has something to do with my service account's roles and permissions. Can you tell me about how your service account is configured? My current memory leaking code doesn't seem to be using the service account so I need to be sure it is configured correctly.

AmosDinh commented 4 years ago

Which drive are you accessing? Your personal one, which you can reach by going to https://drive.google.com? In that case you have to add the service account by its email to a folder as editor/owner, then you can edit or create files in that folder using the service account credentials.

hx2A commented 4 years ago

I just got it to work. I had multiple problems but a big part of it was I needed to share the files on my gdrive with the service acount's email address. Thanks!

marvic2409 commented 4 years ago

I had the same memory leak so I just sprinkled gc.collect() everywhere and bam now its manageable. I doubt that this would count as a fix though

hx2A commented 4 years ago

@marvic2409 you might have a slow memory leak that leaks a small number of MB every hour. For a decent-sized system, it will take some time to become a problem.

busunkim96 commented 4 years ago

Circling back to this, this recommendation from https://github.com/googleapis/google-api-python-client/issues/535#issuecomment-404994715 is the best way to avoid this.

  • Only creating "sheets" once prevents the leak too (if it's created more than once, the leak remains):
    creds = get_credentials()
    http = creds.authorize(httplib2.Http())
    service = discovery.build("sheets", "v4", http=http,
            discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    sheets = service.spreadsheets()
    for i in range(0, 50):
    get_responses(sheets)
    sleep(2) # no leak

Creating multiple service objects results in (1) potential memory problems and (2) takes extra time for refreshing credentials. If you're creating a service object inside a loop, or a function that's called more than once, move it outside the loop/function so it can be reused.