Memory leak somewhere in drf-spectacular

sergei-maertens commented 3 years ago

Describe the bug

Using the SpectacularJSONAPIView (https://github.com/open-formulieren/open-forms/blob/master/src/openforms/api/urls.py#L59), we are observing a memory leak. Memory of the process keeps growing every time the schema endpoint is hit.

We noticed this as the api schema endpoint was configured as Kubernetes pod health check, and the container was getting OOM-killed in a very regular pattern. Happens with two of our apps both on drf-spectacular 0.17.2 and confirmed that 0.20.2 does not fix it (yet).

To Reproduce

Set up SpectacularJSONAPIView
Start the server (manage.py runserver works, but also with uwsgi this has been reproduced)
Find the PID of the server process
Monitor the memory usage with top for that process: top -p <PID> & hit e to get the VIRT/RES memory in megabytes - use a separate shell/tab for this
Fire requests to the schema endpoint, e.g. curl http://localhost:8000/api/v1/
Observe that the memory usage of the process increases.

I'll see if I can find the time to set up a minimal reproducing project without any extra dependencies.

Additionally, I did some debugging with mem_top package (taken from https://github.com/GemeenteUtrecht/zaakafhandelcomponent/issues/490):

refs:
5742    <class 'list'> ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018  Paul T. McGuire\n', '#\n', '# Permiss
5404    <class 'dict'> {PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/rollup.py'):
3364    <class 'dict'> {140013453740288: <weakref at 0x7f576b8d4db0; to 'type' at 0x7f576c2c8900 (type)>, 140013453737696: 
2562    <class 'dict'> {<weakref at 0x7f576512bf90; to 'ActivityViewSet' at 0x7f576a7f40a0>: <drf_spectacular.utils.extend_
2514    <class 'dict'> {(('urn:oasis:names:tc:opendocument:xmlns:animation:1.0', 'audio-level'), None): <function cnv_doubl
2180    <class 'dict'> {'sys': <module 'sys' (built-in)>, 'builtins': <module 'builtins' (built-in)>, '_frozen_importlib': 
2180    <class 'tuple'> (<module 'MarkupPy' from '/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/MarkupPy/__init__.p
2041    <class 'frozenset'> frozenset({PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/ro
1526    <class 'list'> ['"""Thread module emulating a subset of Java\'s threading model."""\n', '\n', 'import os as _os\n',
1311    <class 'dict'> {'__name__': 'lib', '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, '_or

bytes:
147552   {'sys': <module 'sys' (built-in)>, 'builtins': <module 'builtins' (built-in)>, '_frozen_importlib': 
73816    {140013453740288: <weakref at 0x7f576b8d4db0; to 'type' at 0x7f576c2c8900 (type)>, 140013453737696: 
73816    {PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/rollup.py'):
65752    frozenset({PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/ro
47160    ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018  Paul T. McGuire\n', '#\n', '# Permiss
36960    {'__module__': 'lxml.etree', '__doc__': 'Libxml2 error types', '__dict__': <attribute '__dict__' of 
36960    {'application/javascript': ['.js', '.mjs'], 'application/json': ['.json'], 'application/manifest+jso
36960    {(('urn:oasis:names:tc:opendocument:xmlns:animation:1.0', 'audio-level'), None): <function cnv_doubl
36960    {'CRYPTOGRAPHY_PACKAGE_VERSION': <cdata 'char *' 0x7f57665bf000>, 'Cryptography_HAS_EC2M': 1, 'Crypt
36960    {'__name__': 'lib', '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, '_or

types:
89589    <class 'dict'>
45074    <class 'list'>
35322    <class 'tuple'>
33857    <class 'function'>
18048    <class 'collections.OrderedDict'>
10647    <class 'weakref'>
9547     <class 'cell'>
5134     <class 'type'>
4784     <class 'pathlib.PosixPath'>
4183     <class 'django.core.validators.ProhibitNullCharactersValidator'>

and after doing a couple more curl requests you see the refs/bytes increase of drf-spectacular related datastructures:

refs:
5742    <class 'list'> ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018  Paul T. McGuire\n', '#\n', '# Permiss
5404    <class 'dict'> {PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/rollup.py'):
4354    <class 'dict'> {<weakref at 0x7f576512bf90; to 'ActivityViewSet' at 0x7f576a7f40a0>: <drf_spectacular.utils.extend_
3364    <class 'dict'> {140013453740288: <weakref at 0x7f576b8d4db0; to 'type' at 0x7f576c2c8900 (type)>, 140013453737696: 
2514    <class 'dict'> {(('urn:oasis:names:tc:opendocument:xmlns:animation:1.0', 'audio-level'), None): <function cnv_doubl
2180    <class 'dict'> {'sys': <module 'sys' (built-in)>, 'builtins': <module 'builtins' (built-in)>, '_frozen_importlib': 
2180    <class 'tuple'> (<module 'MarkupPy' from '/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/MarkupPy/__init__.p
2041    <class 'frozenset'> frozenset({PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/ro
1526    <class 'list'> ['"""Thread module emulating a subset of Java\'s threading model."""\n', '\n', 'import os as _os\n',
1311    <class 'dict'> {'__name__': 'lib', '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, '_or

bytes:
147552   {'sys': <module 'sys' (built-in)>, 'builtins': <module 'builtins' (built-in)>, '_frozen_importlib': 
73816    {140013453740288: <weakref at 0x7f576b8d4db0; to 'type' at 0x7f576c2c8900 (type)>, 140013453737696: 
73816    {PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/rollup.py'):
73816    {<weakref at 0x7f576512bf90; to 'ActivityViewSet' at 0x7f576a7f40a0>: <drf_spectacular.utils.extend_
65752    frozenset({PosixPath('/home/bbt/.virtualenvs/zac/lib/python3.9/site-packages/elasticsearch/client/ro
47160    ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018  Paul T. McGuire\n', '#\n', '# Permiss
36960    {'__module__': 'lxml.etree', '__doc__': 'Libxml2 error types', '__dict__': <attribute '__dict__' of 
36960    {'application/javascript': ['.js', '.mjs'], 'application/json': ['.json'], 'application/manifest+jso
36960    {(('urn:oasis:names:tc:opendocument:xmlns:animation:1.0', 'audio-level'), None): <function cnv_doubl
36960    {'CRYPTOGRAPHY_PACKAGE_VERSION': <cdata 'char *' 0x7f57665bf000>, 'Cryptography_HAS_EC2M': 1, 'Crypt

types:
133115   <class 'dict'>
64550    <class 'list'>
42050    <class 'tuple'>
35333    <class 'function'>
30305    <class 'collections.OrderedDict'>
12517    <class 'weakref'>
10749    <class 'cell'>
6997     <class 'django.core.validators.ProhibitNullCharactersValidator'>
6967     <class 'rest_framework.validators.ProhibitSurrogateCharactersValidator'>
6494     <class 'drf_spectacular.plumbing.ResolvedComponent'>

Expected behavior

There should not be memory leaks.

sergei-maertens commented 3 years ago

Memory usage graph from our deployment on k8s:

tfranzel commented 3 years ago

Hey @sergei-maertens,

so much for relying on the python's GC for cleaning stuff up. Since we do not use any C modules, there is not really a free() to call :smile:

How much does the memory footprint increase per API call?
Is this a large API?

I tried to replicate with your instructions and saw some increase, but nothing so big as 500mb of dangling memory. We do make a lot of calls into DRF and Django internals, but nothing problematic sticks out yet. They do make a fair amount of use of WeakRef though. I suppose we need to find the references that block the GC from cleaning up and convert those to WeakRefs. That would be my initial guess. We definitely need to break this down further and get an understanding where the leakage is coming from.

Sidenote:

I thought about caching the response before, but there has not been a demand for it yet. In the basic case, the schema is static per server instance, but if you use the 'SERVE_PUBLIC' : False (partial schema for which you have access) or i18n features, it gets more complicated.

sergei-maertens commented 3 years ago

How much does the memory footprint increase per API call?

I'm observing between 0.5 and 5MB - weird thing is that's it not consistent between calls. We have of course base memory usage, app-startup sits at around ~300MB and over time it just increments until the memory limit is hit.

One example of an API schema can be found here: https://github.com/open-formulieren/open-forms/blob/master/src/openapi.yaml. Can I send you a link somewhere privately so you can see the API schema yourself without me having to paste it here publicly?

We definitely need to break this down further and get an understanding where the leakage is coming from.

That I agree with a 100%! I mostly wanted to get it reported in case other people also experience issues and as a self-reminder.

I thought about caching the response before, but there has not been a demand for it yet. In the basic case, the schema is static per server instance, but if you use the 'SERVE_PUBLIC' : False (partial schema for which you have access) or i18n features, it gets more complicated.

Eh, at the view level we can leverage django.core.cache around it, so it's not necessarily a library-feature that's needed at the moment.

tfranzel commented 3 years ago

I'm observing between 0.5 and 5MB

I saw something in that range, but it also did not behave linearly.

Can I send you a link somewhere privately so you can see the API schema yourself without me having to paste it here publicly?

Thanks, but not necessary yet. Just wanted to get a feel for how large it is. I would call open-forms mid-sized.

I mostly wanted to get it reported in case other people also experience issues and as a self-reminder.

:+1: Let me know if you find anything pointing into a specific direction! I will dig deeper when I can alot some time, but any help is appreciated.

ngnpope commented 2 years ago

@sergei-maertens Was this issue encountered when using Python 3.10? If so, 3.10.2 was recently released that addressed a memory leak I think could significantly affect drf-spectacular.

Links for info:

It'd be interesting to see if you can generate some details again from before (3.10.1) and after (3.10.2) the fix.

sergei-maertens commented 2 years ago

Unfortunately it's not that simple. This is seen on Python 3.8 and Python 3.9 :(

StopMotionCuber commented 2 years ago

I'm having some issues with max recursion depth with postprocessing hooks (after calling the schema endpoint several times), this may be related to your issue. Could you maybe try to disable the hooks, call the schema endpoint a few times and observe whether it still memory leaks?

I'll open a separate issue on my issue within the next few days, but I still need to figure out a minimal reproducable example for my issue

tfranzel commented 2 years ago

I'm having some issues with max recursion depth with postprocessing hooks

@StopMotionCuber any insight would be appreciated! Though I have trouble understanding where it could possibly happen in the postprocessing. the postprocessing framework is conceptually super simple and not much magic is going on there. Of course I cannot judge whether you do something funky in your custom hooks.

The default postprocess_schema_enums - although complicated - only performs basic iterative operations and is largely self-contained. There is a recursion in there but it only traverses the schema tree once. So unless you have 100 levels deep oneOf structures, which would be very unlikely, this should never happen.

tfranzel commented 2 years ago

@sergei-maertens are you by any chance using the rollup blueprint?

sergei-maertens commented 2 years ago

I don't think so, but we are doing something similar with other flavors of polymorphism, we're definitely resolving components!

On Mon, 28 Mar 2022, 23:59 T. Franzel, @.***> wrote:

@sergei-maertens https://github.com/sergei-maertens are you by any chance using the rollup blueprint https://github.com/tfranzel/drf-spectacular/blob/master/docs/blueprints/rollup.py ?

— Reply to this email directly, view it on GitHub https://github.com/tfranzel/drf-spectacular/issues/597#issuecomment-1081194931, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKDJVU5KXGXMJE3GS5QB5DVCITTPANCNFSM5HVNERAQ . You are receiving this because you were mentioned.Message ID: @.***>

tfranzel commented 2 years ago

there might be a leak there but since you are not using it nevermind.... "these are not the droids you are looking for" :smile:

tfranzel / drf-spectacular

Memory leak somewhere in drf-spectacular #597