Closed ac000 closed 1 week ago
Now I have dumb questions about whether we're now leaking memory. Sounds like it should be balanced (given the decref) but if decref was only blowing up in some specific cases, that implies that in other cases the refcount was reasonable?
Now I have dumb questions about whether we're now leaking memory. Sounds like it should be balanced (given the
The short answer is, no.
The object/variable is allocated once at module start. Any memory for it that may linger will be returned when the process exits, regardless of what the reference count may be, though we are exiting, we do try and clean up properly...
decref) but if decref was only blowing up in some specific cases, that implies that in other cases the refcount was reasonable?
Possibly for most Python versions the reference stuff could handle an imbalanced decref.
Let me poke at it with GDB and see if I can find what the reference count for this thing actually is at various points...
Yes, apologies, this patch is just masking over the issue. In fact I've no idea why this even gets rid of the crash.
Looking with gdb(1) with the added incref in both cases (failure and non-failure) the reference count on the object goes to two, then to one after the decref. For whatever reason, the fact that it still has a reference when calling Py_Finalize()
in the Python 3.11.9/10 case avoids the crash.
However, I believe the following is the real issue (though not 100% convinced it's all (or any of) our fault...)
Looking at the two important lines from the error output
2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable" in module "wsgi" can not be called to fetch callable
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
The first line shows we are tripping up in the following code in src/python/nxt_python.c::nxt_python_set_target()
463 if (is_factory) {
464 if (nxt_slow_path(PyCallable_Check(obj) == 0)) {
465 nxt_alert(task,
466 "factory \"%s\" in module \"%s\" "
467 "can not be called to fetch callable",
468 callable, module_name);
469 goto fail;
470 }
That is due to
19 wsgi_invalid_callable = None
from test/python/factory/wsgi.py
The important thing here is that obj
is a _PyNone_Type
object and doing a decref on it is what seems to lead to the crash.
The fail
label takes us to
500 fail:
501
502 Py_XDECREF(obj);
503 Py_XDECREF(module);
504
505 return NXT_ERROR;
506 }
If we comment out the Py_XDECREF(obj);
, no crash.
That marries up with the second error line
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
This is also demonstrable with the following test program
/*
* py-none-decref-test.c - Demonstrate decref on Py_None issue
*
* gcc $(python3-config --embed --cflags) -Wall -Wextra -O0 -g -o py-none-decref-test py-none-decref-test.c $(python3-config --embed --libs --ldflags)
*/
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
int opt;
bool decref = false;
PyObject *obj;
while ((opt = getopt(argc, argv, "d")) != -1) {
switch (opt) {
case 'd':
decref = true;
break;
default:
fprintf(stderr, "Usage: %s [-d]\n", argv[0]);
exit(EXIT_FAILURE);
}
}
Py_InitializeEx(0);
obj = Py_None;
if (decref)
Py_DECREF(obj);
Py_Finalize();
exit(EXIT_SUCCESS);
}
With Python 3.12.5
$ ./py-none-decref-test
$ ./py-none-decref-test -d
$
No issues.
With Python 3.11.10
$ ./py-none-decref-test
$ ./py-none-decref-test -d
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: finalizing (tstate=0x00007f70a7e8a718)
Current thread 0x00007f70a80ddb08 (most recent call first):
<no Python frame>
Aborted (core dumped)
$
And there's our error message.
(Just to be clear, the crash is happening from the call to Py_Finalize()
)
After
34 obj = Py_None;
obj
looks like
(gdb) p *obj
$1 = {
ob_refcnt = 3662,
ob_type = 0x7ffff7e59cc0 <_PyNone_Type>
}
Notice the somewhat high reference count.
However not as high as it is under 3.12
(gdb) p *obj
$1 = {
{
ob_refcnt = 4294967295,
ob_refcnt_split = {4294967295, 0}
},
ob_type = 0x7ffff7d56560 <_PyNone_Type>
}
3.12 introduced the idea of immortality for certain often used objects including None
. We also have some extra refcnt tracking...
With Python 3.11.10 after
36 Py_DECREF(obj);
(gdb) p *obj
$2 = {
ob_refcnt = 3661,
ob_type = 0x7ffff7e59cc0 <_PyNone_Type>
}
OK, it decremented the refcnt, but wasn't happy doing so...
With Python 3.12.5
(gdb) p *obj
$2 = {
{
ob_refcnt = 4294967295,
ob_refcnt_split = {4294967295, 0}
},
ob_type = 0x7ffff7d56560 <_PyNone_Type>
}
No effect...
With Python 3.7.3
$ ./py-none-decref-test
$ ./py-none-decref-test -d
$
Before the decref
(gdb) p *obj
$3 = {
ob_refcnt = 4212,
ob_type = 0x7ffff7f3a500 <_PyNone_Type>
}
After...
$4 = {
ob_refcnt = 4211,
ob_type = 0x7ffff7f3a500 <_PyNone_Type>
}
Decremented it, but didn't cause an issue...
I currently have two thoughts...
1) Leave it, it's a bug in Python, it only effects two of the many versions we test on. It only happens at module startup and only if you're doing something obviously wrong.
2) We perhaps shouldn't be trying to decref None
objects, I'd like to find a definitive answer on that, (though most of the versions we test on handle it) and add a check for None
objects.
This does then beg further questions
1) Do we just add a check in the know problem location 2) Do we add checks all over the place 3) Do we wrap Py_XDECREF() and just do the check in there
Oh, and the crash isn't due to a NULL pointer dereference or anything like that, it's simply due to Python calling abort(3)...
(gdb) bt
#0 __restore_sigs (set=set@entry=0x7ffef6ab8760)
at ./arch/x86_64/syscall_arch.h:40
#1 0x00007fe47cd71702 in raise (sig=sig@entry=6) at src/signal/raise.c:11
#2 0x00007fe47cd40be8 in abort () at src/exit/abort.c:11
#3 0x00007fe47c8eb1b5 in fatal_error_exit (status=-1)
at Python/pylifecycle.c:2626
#4 fatal_error (fd=fd@entry=2, header=header@entry=0,
prefix=prefix@entry=0x0, msg=msg@entry=0x0, status=status@entry=-1)
at Python/pylifecycle.c:2737
#5 0x00007fe47c8eb9ed in _Py_FatalErrorFormat (
func=0x7fe47cabcf08 <__func__.3.lto_priv.0> "none_dealloc",
format=0x7fe47ca628f0 "%s: bug likely caused by a refcount error in a C extension") at Python/pylifecycle.c:2857
...
After chatting on the #python IRC channel, it seems we should not be DECREFing without having done an INCREF, even though this None
object has a way above 0 refcnt to start with.
Now to see the best way to deal with this...
So in our actual case the problem is that we do
451 obj = PyDict_GetItemString(PyModule_GetDict(module), callable);
PyDict_GetItemString()
returns a borrowed reference. On which we then end up doing
502 Py_XDECREF(obj);
Which then blows up in Py_Finalize()
.
I think this is the simplest solution
diff --git a/src/python/nxt_python.c b/src/python/nxt_python.c
index 7bbf3d49..c7ba0a90 100644
--- a/src/python/nxt_python.c
+++ b/src/python/nxt_python.c
@@ -462,6 +462,7 @@ nxt_python_set_target(nxt_task_t *task, nxt_python_target_t *target,
"factory \"%s\" in module \"%s\" "
"can not be called to fetch callable",
callable, module_name);
+ Py_INCREF(obj); /* borrowed reference */
goto fail;
}
Re-work patch for proper fix
$ git range-diff --creation-factor=150 c9860dc2...50b1aca3
1: c9860dc2 ! 1: 50b1aca3 python: Add a missing Py_INCREF() in nxt_python_start()
@@ Metadata
Author: Andrew Clayton <a.clayton@nginx.com>
## Commit message ##
- python: Add a missing Py_INCREF() in nxt_python_start()
+ python: Don't decrement a reference to a borrowed object
On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash
- triggered by Py_Finalize() in nxt_python_start() when running one of our
- pytests, namely
+ triggered by Py_Finalize() in nxt_python_atexit() when running one of
+ our pytests, namely
test/test_python_factory.py::test_python_factory_invalid_callable_value
2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable" in module "wsgi" can not be called to fetch callable
1: c9860dc2 ! 1: 50b1aca3 python: Add a missing Py_INCREF() in nxt_python_star
t()
@@ Metadata
Author: Andrew Clayton <a.clayton@nginx.com>
## Commit message ##
- python: Add a missing Py_INCREF() in nxt_python_start()
+ python: Don't decrement a reference to a borrowed object
On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash
- triggered by Py_Finalize() in nxt_python_start() when running one of ou
r
- pytests, namely
+ triggered by Py_Finalize() in nxt_python_atexit() when running one of
+ our pytests, namely
test/test_python_factory.py::test_python_factory_invalid_callable_value
2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable"
in module "wsgi" can not be called to fetch callable
@@ Commit message
<no Python frame>
2024/09/12 15:07:29 [alert] 5451#5451 app process 5452 exited on signal 6 (core dumped)
- It seems this was caused by
+ This was due to
- Py_XDECREF(nxt_py_stderr_flush);
+ obj = PyDict_GetItemString(PyModule_GetDict(module), callable);
- in nxt_python_atexit().
+ in nxt_python_set_target() which returns a *borrowed* reference, then
+ due to the test meaning this is a `None` object we `goto fail` and call
- We never explicitly INCREF() this variable (though there are plenty of
- other cases of doing DECREF()'s with no previous INCREF()'s). However in
- this case adding an INCREF() for this variable does fix the crash.
+ Py_DECREF(obj);
- I'm not sure we really need to reference count this variable, it's only
- used in nxt_python_print_exception(), but it's perhaps the safest course
- of action.
+ which then causes `Py_Finalize()` to blow up.
+
+ The simple fix is to just increment its reference count before the `goto
+ fail`.
+
+ Note: This problem only showed up under (the various versions of Python
+ we test on); 3.11.9 & 3.11.10. It doesn't show up under; 3.6, 3.7, 3.9,
+ 3.10, 3.12
Cc: Konstantin Pavlov <thresh@nginx.com>
Closes: https://github.com/nginx/unit/issues/1413
+ Fixes: a9aa9e76d ("python: Support application factories")
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
## src/python/nxt_python.c ##
-@@ src/python/nxt_python.c: nxt_python_start(nxt_task_t *task, nxt_process_data_t *data)
- goto fail;
- }
+@@ src/python/nxt_python.c: nxt_python_set_target(nxt_task_t *task, nxt_python_target_t *target,
+ "factory \"%s\" in module \"%s\" "
+ "can not be called to fetch callable",
+ callable, module_name);
++ Py_INCREF(obj); /* borrowed reference */
+ goto fail;
+ }
-+ Py_INCREF(nxt_py_stderr_flush);
-+
- if (nxt_slow_path(nxt_python_set_path(task, c->path) != NXT_OK)) {
- goto fail;
- }
On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash triggered by Py_Finalize() in nxt_python_atexit() when running one of our pytests, namely test/test_python_factory.py::test_python_factory_invalid_callable_value
This was due to
in nxt_python_set_target() which returns a borrowed reference, then due to the test meaning this is a
None
object wegoto fail
and callwhich then causes
Py_Finalize()
to blow up.The simple fix is to just increment its reference count before the
goto fail
.Note: This problem only showed up under (the various versions of Python we test on); 3.11.9 & 3.11.10. It doesn't show up under; 3.6, 3.7, 3.9, 3.10, 3.12