Closed mrc0mmand closed 4 years ago
I've been able to replicate this with a variant of your script, thanks!
I think it may be possible to cache the result of createMethod such that we don't create a new closure for every call, which would make the memory leak less dramatic (probably not noticeable), and faster. But before I do that I'll try to identify the actual cycle.
Running debugging commentary for future reference:
Not a function of the docstring being large; if I change createMethod to not set the docstring onto the dynamically generated method, it still leaks.
Causing the generated method to be a noop doesn't fix the leak.
Only creating "sheets" once prevents the leak too (if it's created more than once, the leak remains):
Code:
creds = get_credentials()
http = creds.authorize(httplib2.Http())
service = discovery.build("sheets", "v4", http=http,
discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
sheets = service.spreadsheets()
for i in range(0, 50):
get_responses(sheets)
sleep(2) # no leak
Code:
def __del__(self):
for dynamic_attr in self._dynamic_attrs:
del self.__dict__[dynamic_attr]
del self._dynamic_attrs
objgraph
library fails to show growing references after the first iteration. Further frustration
with this is that importing objgraph causes the leak to go away, causing a Heisenbug. This
turns out to be because objgraph imports graphviz, and something in graphviz.. does.. something.
But even if graphviz doesnt get imported by objgraph, calling show_growth() also causes a Heisenbug;
calling show_growth() prevents the leak.Code:
if __name__ == "__main__":
creds = get_credentials()
for i in range(0, 100):
get_responses(creds)
objgraph.show_growth() # prints nothing
sleep(1)
Code:
def close(self):
for dynamic_attr in self._dynamic_attrs:
del self.__dict__[dynamic_attr]
self._dynamic_attrs = []
service.spreadsheets().values()
the ValueError('Cannot call function; instance was purged')
error
is raised.Code:
def _bind_weakmethod(self, attr_name, func):
instance_ref = weakref.ref(self)
cls = self.__class__
def themethod(*arg, **kw):
instance = instance_ref()
if instance is None:
raise ValueError('Cannot call function; instance was purged')
function = func.__get__(instance, cls)
return function(*arg, **kw)
self._set_dynamic_attr(attr_name, themethod)
And then eg.:
# Add basic methods to Resource
if 'methods' in resourceDesc:
for methodName, methodDesc in six.iteritems(resourceDesc['methods']):
fixedMethodName, method = createMethod(
methodName, methodDesc, rootDesc, schema)
self._bind_weakmethod(fixedMethodName, method)
This is the "leak", a very simple circref. It cannot be avoided without a redesign.
It is caused by implementing "dynamic methods" using the descriptor protocol, ala https://github.com/google/google-api-python-client/blob/master/googleapiclient/discovery.py#L1088
If there is a way around this circref, I don't know it.
Now that we've figured why we perceive there is a leak, a) is it a problem? and b), if so, what can we do about it?
Yes and no.
Theoretically, no, it's not a problem. Reference cycles in programs are normal, and Python's garbage collector will eventually trash the objects involved in cycles. Every time a Resource object is created (by calling methods on other Resource objects), we get some cycles between that Resource and its dynamic methods, and, in theory, this is fine.
Practically, yes, it's a problem. Repeated Resource creatoin causes the process RSS to bloat, and, on Linux at least, the memory consumed by these references is not given back to the OS due to memory fragmentation, even after the cycles are broken.
I've put it some work on a branch (https://github.com/mcdonc/google-api-python-client/tree/fix-createmethod-memleak-535) trying to make the symptoms slightly better.
Try #1 on that branch, which is now a few commits back, and isn't represented by the current state of the branch, was caching the functions that become methods on Resource objects, only creating one function per input instead of one per call. This is not a reasonable fix, however, because refs involved in cycles still grow; every time a Resource is instantiated, it binds itself to some number of methods, and even if the functions representing these methods are not repeatedly created, the act of binding cached methods to each still creates cycles.
Try #2, which represents the current state of the branch, dynamically creates and caches one Resource class per set of inputs, instead of just caching the result of dynamic method creation. This disuses the descriptor protocol to bind dynamic methods to instances, so the only circrefs are those as if each resource type had its own class in sys.modules['googleapiclient.discovery']. The number of circrefs is dramatically reduced, and RSS growth is bounded after the first call of the replication script (unlike master, where it grows without bound on each call, although every so often gc kicks in and brings it down a little). According to gc.set_debug(gc.DEBUG_LEAK) under py 3.6, he length of gc.garbage is 2214 after 40 iterations of the reproducer script for-loop, instead of master's gargantuan 45218. And I believe we could bring that down more by fixing an unrelated different leak. However, the resulting instances cannot be pickled, which is, I believe, part of their API.
So I think we have these options:
Cause pickling to no longer be part of the API and use the code on the https://github.com/mcdonc/google-api-python-client/tree/fix-createmethod-memleak-535 branch. If pickling was absolutely necessary, we could create (and instruct people to use) a custom unpickler that generated the dynamic classes for us, by subclassing Unpickler ala https://docs.python.org/3/library/pickle.html#pickle.Unpickler
Cause Resource instances to refer to any subresource they create and expose a "close()" method on resources, which would clear its dynamic methods and the dynamic methods of any subresource recursively, which would break the refcycle. However, this method would need to be explicitly called by library users; there isn't a natural place for us to call it.
Do nothing. We punt and tell people to a) create as few resource as possible (don't do more work than is necessary) and to b) call gc.collect() after sections of their code that have a side effect of creating lots of resources.
Thank you very much for the thorough analysis. As you already said, a complete fix would require some time to implement, so the temporary fix using a cache along with gc.collect()
is good enough for now at keeping the memory consumption at bay.
Again, thanks!
@mrc0mmand yes, in your particular case, creating "sheets" only once would make it leak so little that you won't need gc.collect()
@theacodes can you advise about which one of the options in https://github.com/google/google-api-python-client/issues/535#issuecomment-404994715 is most appropriate?
@mcdonc @theacodes If I get a vote I'd like to see the second option, adding a .close() method. I've spent the past week or so tracking down this same memory error and found my way to this page. It happens I specifically looked for close() methods in the Resource objects because I knew something somewhere wasn't being released.
Adding a .close() method seems cleaner than my having to call gc.collect(). Either way I have to do something to cleanup resources and calling .close() is analogous to what we do already for files and other things.
In any case, this issue should be mentioned in the documentation and sample code, please!
I have a cron job, on google app engine, that reads data in from a google sheet. I am noticing the same memory leak (or maybe a different memory leak?). I tried the recommend work arounds: 1. creating the "sheets" object only once, and use gc.collect(). Neither worked in my case. As a a test, I changed the few lines of code that read data from a google sheet to read data from a database table, and the memory leak went away.
I have a cron job, on google app engine, that reads data in from a google sheet. I am noticing the same memory leak (or maybe a different memory leak?). I tried the recommend work arounds: 1. creating the "sheets" object only once, and use gc.collect(). Neither worked in my case. As a a test, I changed the few lines of code that read data from a google sheet to read data from a database table, and the memory leak went away.
Can you help me to clarify your last sentence here? Did you ever fix the code, or just confirm the memory leak? I'm in the same situation, app engine job that reads google sheet and started getting "Exceeded soft memory limit" errors. And like you the garbage collection suggestions did not help my situation.
I never fixed it... in the short term, I used a high mem appengine instance so that would take longer to hit the memory threshold and then, as a long term solution, I switched to airtable instead of google sheets.
I never fixed it... in the short term, I used a high mem appengine instance so that would take longer to hit the memory threshold and then, as a long term solution, I switched to airtable instead of google sheets.
Got it - I'll look into using airtable instead. Great suggestion. Appreciate the help.
My solution in the end was just too use "pure" http requests
@AmosDinh Can you elaborate? In my project the memory leak issue has reared its ugly head again and I am looking for new approaches to deal with it.
@hx2A Well I don't use the the api at all,except for creating credentials. Here is an example from my class.
def buildCredentials(self,refresh=False):
if refresh:
self.credentials.refresh(httplib2.Http())
self. credentials = ServiceAccountCredentials.from_json_keyfile_name(
"myCreds.json",
scopes=[
'https://www.googleapis.com/auth/drive',
'https://www.googleapis.com/auth/spreadsheets',
])
delegated_credentials = credentials.create_delegated(
[service_account_email])
access_token = delegated_credentials.get_access_token().access_token
self.headers = {
"Authorization": "Bearer "+access_token,
}
and
def spreadsheetAppend(self,spreadsheetId,list2d,secondcycle=False):
valueInputOption="RAW"
range='!A:A'
url = f"https://sheets.googleapis.com/v4/spreadsheets/{spreadsheetId}/values/{range}:append?valueInputOption={valueInputOption}"
data = json.dumps({
"range":range,
"majorDimension":"ROWS",
"values":list2d
})
r = requests.post(url,headers=self.headers,data=data)
if r.status_code>399:
print(r,r.headers)
if r.statuscode==401 and not secondcycle:
self.buildCredentials(refresh=True)
self.spreadsheetAppend(spreadsheetId,list2d,secondcycle=True)
@AmosDinh Thank you, I understand now. This is helpful.
@hx2A glad I could help you
Alternatively, you could have the sheets api code in another process which you could terminate after execution / RAM usage hitting a certain threshold. - Just wanted to include that option
@AmosDinh I'm trying to get this working now but I keep getting 403 'Forbidden' responses. I believe it has something to do with my service account's roles and permissions. Can you tell me about how your service account is configured? My current memory leaking code doesn't seem to be using the service account so I need to be sure it is configured correctly.
Which drive are you accessing? Your personal one, which you can reach by going to https://drive.google.com? In that case you have to add the service account by its email to a folder as editor/owner, then you can edit or create files in that folder using the service account credentials.
I just got it to work. I had multiple problems but a big part of it was I needed to share the files on my gdrive with the service acount's email address. Thanks!
I had the same memory leak so I just sprinkled gc.collect() everywhere and bam now its manageable. I doubt that this would count as a fix though
@marvic2409 you might have a slow memory leak that leaks a small number of MB every hour. For a decent-sized system, it will take some time to become a problem.
Circling back to this, this recommendation from https://github.com/googleapis/google-api-python-client/issues/535#issuecomment-404994715 is the best way to avoid this.
- Only creating "sheets" once prevents the leak too (if it's created more than once, the leak remains):
creds = get_credentials() http = creds.authorize(httplib2.Http()) service = discovery.build("sheets", "v4", http=http, discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False) sheets = service.spreadsheets() for i in range(0, 50): get_responses(sheets) sleep(2) # no leak
Creating multiple service objects results in (1) potential memory problems and (2) takes extra time for refreshing credentials. If you're creating a service object inside a loop, or a function that's called more than once, move it outside the loop/function so it can be reused.
There seems to be a memory leak when using the google-api-client with GSheets.
Environment:
Here's a simple reproducer (without a
.client_secret.json
):For measurements I used
memory_profiler
module with following results:First and second iteration
Last iteration
There's clearly a memory leak, as the reproducer fetches the same data over and over again, yet the memory consumption keeps rising. Full log can be found here.
As a temporary workaround for one of my long-running applications I use an explicit garbage collector call, which mitigates this issue, at least for now:
I went a little deeper, and the main culprit seems to be in the
createMethod
function when creating dynamic methodbatchUpdate
:(This method has a huge docstring.)
Nevertheless, there is probably a reference loop somewhere, as the
gc.collect()
call manages to collect all those unreachable objects.