python / cpython

The Python programming language
https://www.python.org
Other
63.48k stars 30.4k forks source link

`_abc._abc_subclasscheck` has very poor performance and (I think) a memory leak #92810

Open samuelcolvin opened 2 years ago

samuelcolvin commented 2 years ago

Bug report

I'm been hunting a memory leak in pydantic and I keep seeing _abc_subclasscheck popping up in the memray flamegraph.

Locally I've seen memray report that _abc_subclasscheck is using 2.8GB of memory in some cases!

I can't get anything that bad in a minimal example, but I have the following:

from abc import ABCMeta
from datetime import datetime

class MyMetaclass(ABCMeta):
    pass

class MyClass(metaclass=MyMetaclass):
    pass

def main():
    class Foobar(MyClass):
        pass

    assert issubclass(Foobar, MyClass)
    assert not issubclass(int, MyClass)
    assert not issubclass(str, MyClass)
    assert not issubclass(datetime, MyClass)
    t = type('A', (), {})
    assert not issubclass(t, MyClass)

if __name__ == '__main__':
    import os, psutil
    process = psutil.Process(os.getpid())
    mb = 1024 * 1024
    last = 0
    for i in range(5_000):
        main()
        # mem = process.memory_info().rss
        # print(f'{i + 1:>4d} {mem / mb:8.2f}MB {(mem - last) / mb:+8.2f}MB | {"━" * int(mem / 8_000_000)}')
        # last = mem

A few things to note:

Your environment

sweeneyde commented 2 years ago

From what I can gather, there is no actual leak strictly speaking, but it seems like it might be a corner case where that garbage collector isn't running as often as it maybe should be for this scenario.

Doing something like below shows some strange artifacts where the "before collection" case has widely variable numbers of weakref.weakref objects lingering around, and widely variable performance. However, calling gc.collect() consistently takes the number back down. Does manually adding collect() calls fix your use-case?

from abc import ABCMeta
from gc import collect, get_objects
from collections import Counter
from pprint import pprint

class MyMetaclass(ABCMeta):
    pass

class MyClass(metaclass=MyMetaclass):
    pass

def main():
    class Sub(MyClass):
        pass
    assert issubclass(Sub, MyClass)

    class Other:
        pass
    assert not issubclass(Other, MyClass)

def print_object_types(n=5):
    types = Counter(map(type, get_objects())).most_common(n)
    pprint(types, underscore_numbers=True)

N = 1000

for _ in range(1000):
    N += N//10
    for j in range(N):
        main()
    print(N)
    print("before collection:")
    print_object_types()
    while collect():
        pass
    print("after collection:")
    print_object_types()
    print("\n\n")

@pablogsal any ideas?

pablogsal commented 2 years ago

I will investigate later but you can also try to adjust the gc threshold down to force collections more often and another thing is to set the GC flags to debug flags to gc.DEBUG_SAVEALL so the cycles are being saved so we can inspect what objects are responsible or the memory and why they are forming cycles.

samuelcolvin commented 2 years ago

Rather than the GC, and since the problem seems to be more severe when the checks return false, I thought this might be _abc_negative_cache growing indefinitely?

pablogsal commented 2 years ago

Rather than the GC, and since the problem seems to be more severe when the checks return false, I thought this might be _abc_negative_cache growing indefinitely?

Is possible, but we need some analysis and unfortunately I am writing this from my phone while hiking so I cannot test until I am back at home :sweat_smile:

I would be surprised if this is a GC problem to be honest because that would involve some circular references and unless I am missing something there should not be growing isolated cycles here.

pablogsal commented 2 years ago

My educated guess is that is likely the combination of this:

https://github.com/python/cpython/blob/36d42e701f8a4040e7daf70138217038bfde3c78/Modules/_abc.c#L201

https://github.com/python/cpython/blob/36d42e701f8a4040e7daf70138217038bfde3c78/Modules/_abc.c#L195

https://github.com/python/cpython/blob/36d42e701f8a4040e7daf70138217038bfde3c78/Modules/_abc.c#L190

sweeneyde commented 2 years ago

I notice that the pure-python version uses a WeakSet where the c version uses a set of weakrefs.

https://github.com/python/cpython/blob/953ab0795243900ccccaaca069d932730a86fc20/Lib/_py_abc.py#L50

pablogsal commented 2 years ago

I asked a colleague to run the memray flamegraphs with native introspection while I return home. It shows clearly that my guess was right and is the ABC cache the one that is responsible of this problem.

Note: This uses the latest development branch of memray so you may see something slightly different.

Leaks

This shows the memory that is there at the end of execution

leaks )

Peak

This shows where the chunks of memory that form the heap when the memory is at its peak where allocated

peak

Resident size over time

plot

You can see in the last plot how the cache is periodically cleaned up (maybe because GC passes?), but there is something that survives.

zrothberg commented 2 years ago

So I messed around with python version of this a bit. By just removing the negative cache from ABCMeta it become around 30x faster and no longer had a memory leak taking 2 second to do 10000 iterations. Even manually calling gc.collect seems to be unable to free up all the memory.

It grows from 11mb at startup to 68 MB on end.

I went further and swapped to a regular set() this had a memory leak but it was actually about half the size of the original memory leak and only went up to 33mb.

VERY VERY VERY strangely adding that to __subclasscheck__ right before setting the cache actually speeds up the program about 3 times.

if cls.__name__=="Foobar":
    pass
else:
    print(f"{cls.__name__} has {len(cls.__subclasses__())} subclasses")

My best guess is the issue is just one of pure volume.

https://github.com/python/cpython/blob/733e15f1707ddec502a69c8c324c77e02ca11fa9/Lib/_py_abc.py#L141-L147

Because you are recursively running instance check and recursively caching the values its just creating a ton of items in the WeakRefs. Bumping the original code up to two child classes from MyClass grinds to an absolute halt. The real issue is if there is ever a parent class that has many children (like every single item in pydantic) you are going to perform subclass checking on 100+ different child classes and are going to be caching values in all 100+ different child classes. Because gc isn't able to catch up during this Myclass ends up with a few thousand subclasses still attached to it each of which is recursively gone through. Manually calling gc cleared the weakreferences from the __subclasses__ method but the leaked memory does not recover

Further it is actually bugged and doesn't work properly which is actually what is stopping this from melting down all the time. If a class was already called with isinstance when its called a second time it will never run through all the subclasses as the cache will not break and it will be short circuited at the top. Frankly that bug probably prevented a lot of major issues.

https://github.com/python/cpython/blob/733e15f1707ddec502a69c8c324c77e02ca11fa9/Lib/_py_abc.py#L115-L121

The cache is never invalidated for subclassing just for registrations.

@samuelcolvin For pydantic I would just replace the __subclasscheck__ on the metaclass for the main pydanitc metaclass. This should clear up the bulk of the memory leak as well as likely pretty dramatically increase the actual speed since I doubt it needs to check every single child class on a miss. I would presume it in fact should never need to because all child classes directly descended from the BaseClass.

corona10 commented 2 years ago

The ratio of cache hit/miss looks super low for this scenario, can we switch caches into LRUCache?

zrothberg commented 2 years ago

I honestly dont think it will fix the core issue. The main problem is just the way it walks down the class tree. This makes any sufficiently large class hierarchy impossibly slow and causes massive memory consumption. Switching to an LRU cache only solves the memory part but the above example of 10k comparisons still takes 2 mins to run on my macbook because it causes 10k lookups per issubclass call.

If you are generating classes the bulk of which never get compared in any natural way them getting involved in their parents issubclass call is unintuitive at best. Frankly I am not sure I fully understand the intent of it. I presume its for registered calls but if it is not and the intent is to allow subclasses to specialize how they check then it should just be considered bugged.

iritkatriel commented 1 year ago

See also https://github.com/python/cpython/issues/94284.