pydantic / pydantic

Data validation using Python type hints
https://docs.pydantic.dev
MIT License
19.79k stars 1.79k forks source link

Memory segfaults after V2 upgrade #7211

Closed StasEvseev closed 3 months ago

StasEvseev commented 11 months ago

Initial Checks

Description

Thanks for amazing project! We have been using pydantic for couple of years and it become a standard building block for our codebase.

Everything seems to work, except that once we made a change to v2 version. There has been some problems with a segfaults on production environment.

We haven't figured out a way to reproduce it locally, to provide you more details then just logs from our production environment.

Our setup:

And those are segfaults we are facing on production:

segfault at 0 ip 000078b1d0ff6f1c sp 00007ffe971366c0 error 6 in libpython3.11.so.1.0[78b1d0eed000+1bb000]
segfault at 100 ip 00007ddfb64fe349 sp 00007ffc351395b0 error 4 in _pydantic_core.cpython-311-x86_64-linux-gnu.so
segfault at e4 ip 00000000000000e4 sp 00007ffc30d59a58 error 14

Example Code

No response

Python, Pydantic & OS Version

/usr/src/app# python -c "import pydantic.version; print(pydantic.version.version_info())"
             pydantic version: 2.2.1
        pydantic-core version: 2.6.1
          pydantic-core build: profile=release pgo=true
                 install path: /usr/local/lib/python3.11/site-packages/pydantic
               python version: 3.11.3 (main, May 23 2023, 13:34:03) [GCC 10.2.1 20210110]
                     platform: Linux-5.15.49-linuxkit-x86_64-with-glibc2.31
     optional deps. installed: ['email-validator', 'typing-extensions']

Selected Assignee: @dmontagu

samuelcolvin commented 11 months ago

Thanks for reporting this.

The most likely explanation for this is that pydantic v2 is using more memory than v1, and when you exceed the memory available seg faults occur.

Maybe try running checking the memory output right before the segfault, or running fewer workers and see if the seg faults stop.

I don't know of any other reason why pydantic V2 should segfault, if you can give us more detail, we'll investigate immediately.

StasEvseev commented 11 months ago

Hey @samuelcolvin ! Thanks for a quick reply.

We run one experiment on prod to collect more info about segmentaion faults. We enabled faulthandler to provide output whenever segfault occurs it output the threads and their traceback.

We captured 3 cases, two of them segfault occured when current threads were holding GIL and run GarbageCollection cycle and one just holding the GIL.

We also got core dumps from the machine, but it doesn't help much, hard to read what is going on in Runtime:

#0  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  0x00007ba01d5b5a2e in ?? () from /usr/local/bin/../lib/libpython3.11.so.1.0
#3  0x000056b57074d340 in ?? ()
#4  0x00007ba01d87ef28 in _PyRuntime () from /usr/local/bin/../lib/libpython3.11.so.1.0
#5  0x0000000000000001 in ?? ()
#6  0x00007ba01cf7cd60 in ?? ()
#7  0x601a6ea5c71a3b00 in ?? ()
#8  0x00007ba01807a730 in ?? ()
#9  0x00007ba0045e82c0 in ?? ()
#10 0x00007ba01807a6a8 in ?? ()
#11 0x00007ba01d8a7558 in _PyRuntime () from /usr/local/bin/../lib/libpython3.11.so.1.0
#12 0x00007ba01807a728 in ?? ()
#13 0x0000000000000001 in ?? ()
#14 0x00007ba01807a6a8 in ?? ()
#15 0x00007ba01d5c0e87 in _PyEval_EvalFrameDefault () from /usr/local/bin/../lib/libpython3.11.so.1.0
#16 0x00007ba01d5bcf52 in ?? () from /usr/local/bin/../lib/libpython3.11.so.1.0
#17 0x00007ba01d5db4da in ?? () from /usr/local/bin/../lib/libpython3.11.so.1.0

Do you think it is something you can work with? I can ensure that physical memory usage wasn't even reached the limit we had for the container. Does it answer the question about memory pressure?

davidhewitt commented 11 months ago

Do you run very recursive models? It may not be heap memory but stack overflow.

Does it reproduce if you update to Python 3.11.4? I see no mention of segfault fixes in the 3.11.4 changelog though, so I would guess it won't help.

The fact that the crash is different each time feels a little bit like memory corruption to me. We use very limited unsafe Rust in pydantic-core; I'll audit this and also see if a valgrind run yields anything.

Alternatively, is it possible for you to run with debug-instrumented Python and pydantic-core versions so the core dumps are more useful? I can potentially help with configuring a custom pydantic-core build to contain debug info, for Python it depends how your production is deployed.

davidhewitt commented 11 months ago

In https://github.com/pydantic/pydantic-core/pull/922 I've run through the unsafe which is used in pydantic-core and either eliminated or justified.

StasEvseev commented 11 months ago

@davidhewitt Thanks for reply!

By running debug-instrumented Python, do you mean run my gunicorn using python3d binary? Like so python3d -m gunicorn .... If I can get some guidance how to do that, that would be amazing! Unfortunately issue hardly reproducible. I can try to simulate certain things thought.

davidhewitt commented 10 months ago

python3d might be overkill because it adds a lot of assertions and I believe there may be some compatibility issues for pydantic-core anecdotally from other threads (I might try to verify this in CI sometime). Also no prebuilt wheels exist so you'll have to compile all your native dependencies.

It would be a great start if you can download or build your CPython with debug symbols included so the core dumps are much more readable. Potentially you could also build your own pydantic-core from source with debug symbols included there too.

StasEvseev commented 9 months ago

Hey @davidhewitt !

How can we instrument our Python with debug symbols? Like build it with extra CFLAGS:

-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer

Can you also help me with building pydantic-core with debug symbols included?

Thanks!

davidhewitt commented 9 months ago

@StasEvseev I just built my own 3.12 interpreter from source using just the "optimized" configure options here: https://devguide.python.org/getting-started/setup-building/#optimization

This contained debug information, so it looks like the debug info stripping is probably done by your distro packager.

~/dev/cpython$ ./python --version
Python 3.12.0
~/dev/cpython$ file ./python
./python: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7487db6f0e6d73eda7cb2dbddb39706d3658e7b3, for GNU/Linux 3.2.0, with debug_info, not stripped

Which linux distribution are you using? There may be an optional package to install python debug info alongside the main executable, as an alternative to building from source. (That said, I could not see one for ubuntu.)


As for pydantic-core, you just need to have the environment variables CARGO_PROFILE_RELEASE_STRIP=false and CARGO_PROFILE_RELEASE_DEBUG=limited (source) set during a build from source. So clone the repo, check out the tag which matches your pydantic version, and run one of the two make tasks below:

CARGO_PROFILE_RELEASE_STRIP=false CARGO_PROFILE_RELEASE_DEBUG=limited make build-prod

# or if you want fully-optimized
CARGO_PROFILE_RELEASE_STRIP=false CARGO_PROFILE_RELEASE_DEBUG=limited make build-pgo

can see that it contains debug info:

$ python
Python 3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pydantic_core
>>> pydantic_core._pydantic_core
<module 'pydantic_core._pydantic_core' from '/home/david/dev/pydantic/pydantic-core/python/pydantic_core/_pydantic_core.cpython-311-x86_64-linux-gnu.so'>
>>> exit()
david@david-pc:~/dev/pydantic/pydantic-core$ file /home/david/dev/pydantic/pydantic-core/python/pydantic_core/_pydantic_core.cpython-311-x86_64-linux-gnu.so
/home/david/dev/pydantic/pydantic-core/python/pydantic_core/_pydantic_core.cpython-311-x86_64-linux-gnu.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=728db72c66fe6364fd9694a6ae4df7aac998d434, with debug_info, not stripped

EDIT: added suggestion for CARGO_PROFILE_RELEASE_DEBUG=line-tables-only too CARGO_PROFILE_RELEASE_DEBUG=limited (maturin had an issue with line-tables-only)

StasEvseev commented 9 months ago

Hey @davidhewitt ! Thank for comprehensive answer!

What we are using is python docker image. I don't see where python build got stripped, but this is what I see on the container:

/usr/local/bin/python3.12: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=e1466a54058de9be791ef96f61c8e185388684eb, for GNU/Linux 3.2.0, stripped

Link to docker source https://github.com/docker-library/python/blob/b7b91ef359a740a91caeabce414ce4ee70fd2b23/3.11/bookworm/Dockerfile#L44.

I might try to build custom python with your suggested flags.

davidhewitt commented 9 months ago

If I had to guess, the stripping is done as a linker argument via

    LDFLAGS="$(dpkg-buildflags --get LDFLAGS)"; \
bogdandm commented 9 months ago

We also have same or similar problem.

18/Oct/2023 13:12:33.283 ERROR [common.components.base.base:303] ../Objects/dictobject.c:1899: bad argument to internal function
Traceback (most recent call last):
  File ".../common/components/base.py", line 299, in _get_data
    result[comp.name] = comp.get_data(context_storage=context_storage, data_storage=data_storage)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../common/components/base.py", line 136, in get_data
    data = self._get_data(context_storage, data_storage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../common/components/base.py", line 212, in _get_data
    data = component.get_data(context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../common/components/base.py", line 80, in get_data
    return self._get_data(context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../tv_site/components/data/request_context.py", line 102, in _get_data
    return self.get_instance_result_model().model_validate(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/pydantic/main.py", line 503, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SystemError: ../Objects/dictobject.c:1899: bad argument to internal function
Fatal Python error: Segmentation fault
/* CAUTION: PyDict_SetItem() must guarantee that it won't resize the
 * dictionary if it's merely replacing the value for an existing key.
 * This means that it's safe to loop over a dictionary with PyDict_Next()
 * and occasionally replace a value -- but you can't insert new keys or
 * remove them.
 */
int
PyDict_SetItem(PyObject *op, PyObject *key, PyObject *value)
{
    if (!PyDict_Check(op)) {
        PyErr_BadInternalCall(); // <------ This line
        return -1;
    }
    assert(key);
    assert(value);
    Py_INCREF(key);
    Py_INCREF(value);
    return _PyDict_SetItem_Take2((PyDictObject *)op, key, value);
}

(or just seg fault without any usefull message or traceback)

Python 3.11.5
Ubuntu 22.04.3 LTS
pydantic==2.4.2
pydantic_core==2.10.1
gevent==23.9.1
gunicorn==21.2.0

I can reproduce it locally with one worker setup. But unfortunately I can not figure out minimal code example, it just happens from time to time.

Is there any info that could help you? We already started updating our project to v2 and now we are stuck with half of our models being v1 and others - v2.

davidhewitt commented 8 months ago

@bogdandm does the error ever include the native stack trace? That would be extremely helpful to review where the problem is coming from. Alternatively if you are able to get a core dump (e.g. try running with ulimit -c unlimited) and share relevant parts here that would also greatly help 🙏

bogdandm commented 8 months ago

@davidhewitt I haven't been able to figure out yet how to get more detailed logs or usual "core dumped" error (until now I believed that it is default behavior, at least in our docker environment). I already tried faulthandler.enable() but it gives just python traceback, no CPython or Rust code.

But I'll probably try again a little later when I have more time to debug it.

davidhewitt commented 8 months ago

If you have a way to reproduce it locally perhaps we can also discuss a way for me to help debug your code in a confidential environment.

bogdandm commented 8 months ago

Okay, I can reproduce it within gdb , so there is stack trace

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000000000529d7a in PyObject_GetIter ()
(gdb) bt
#0  0x0000000000529d7a in PyObject_GetIter ()
#1  0x000000000053b982 in _PyEval_EvalFrameDefault ()
#2  0x00000000005a8368 in ?? ()
#3  0x00007ffff1cd9908 in pyo3::types::any::{impl#1}::get_item::inner () at /home/bogdan-dm/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.2/src/types/any.rs:777
#4  pyo3::types::any::PyAny::get_item<&pyo3::instance::Py<pyo3::types::string::PyString>> () at /home/bogdan-dm/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.2/src/types/any.rs:781
#5  pyo3::types::mapping::PyMapping::get_item<&pyo3::instance::Py<pyo3::types::string::PyString>> () at /home/bogdan-dm/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.2/src/types/mapping.rs:50
#6  _pydantic_core::lookup_key::LookupKey::py_get_mapping_item () at src/lookup_key.rs:163
#7  0x00007ffff1d47a57 in _pydantic_core::validators::model_fields::{impl#1}::validate::{closure#3}<pyo3::types::any::PyAny> () at src/validators/model_fields.rs:181
#8  _pydantic_core::validators::validation_state::ValidationState::with_new_extra<core::ops::control_flow::ControlFlow<_pydantic_core::errors::line_error::ValError, ()>, _pydantic_core::validators::model_fields::{impl#1}::validate::{closure_env#3}<pyo3::types::any::PyAny>> () at src/validators/validation_state.rs:37
#9  _pydantic_core::validators::model_fields::{impl#1}::validate<pyo3::types::any::PyAny> () at src/validators/model_fields.rs:298
#10 0x00007ffff1d4471c in _pydantic_core::validators::model::ModelValidator::validate_construct<pyo3::types::any::PyAny> () at src/validators/model.rs:277
#11 0x00007ffff1d47afe in _pydantic_core::validators::model_fields::{impl#1}::validate::{closure#3}<pyo3::types::any::PyAny> () at src/validators/model_fields.rs:197
#12 _pydantic_core::validators::validation_state::ValidationState::with_new_extra<core::ops::control_flow::ControlFlow<_pydantic_core::errors::line_error::ValError, ()>, _pydantic_core::validators::model_fields::{impl#1}::validate::{closure_env#3}<pyo3::types::any::PyAny>> () at src/validators/validation_state.rs:37
#13 _pydantic_core::validators::model_fields::{impl#1}::validate<pyo3::types::any::PyAny> () at src/validators/model_fields.rs:298
#14 0x00007ffff1d4471c in _pydantic_core::validators::model::ModelValidator::validate_construct<pyo3::types::any::PyAny> () at src/validators/model.rs:277
#15 0x00007ffff1e17825 in _pydantic_core::validators::SchemaValidator::_validate<pyo3::types::any::PyAny> () at src/validators/mod.rs:338
#16 _pydantic_core::validators::SchemaValidator::validate_python () at src/validators/mod.rs:160
#17 0x00007ffff1e18f9f in _pydantic_core::validators::SchemaValidator::__pymethod_validate_python__ () at src/validators/mod.rs:112
#18 0x00007ffff1c8414c in pyo3::impl_::trampoline::fastcall_with_keywords::{closure#0} () at /home/bogdan-dm/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.2/src/impl_/trampoline.rs:41
#19 pyo3::impl_::trampoline::trampoline::{closure#0}<pyo3::impl_::trampoline::fastcall_with_keywords::{closure_env#0}, *mut pyo3_ffi::object::PyObject> ()
    at /home/bogdan-dm/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.2/src/impl_/trampoline.rs:181
#20 std::panicking::try::do_call<pyo3::impl_::trampoline::trampoline::{closure_env#0}<pyo3::impl_::trampoline::fastcall_with_keywords::{closure_env#0}, *mut pyo3_ffi::object::PyObject>, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> () at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502
#21 std::panicking::try<core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>, pyo3::impl_::trampoline::trampoline::{closure_env#0}<pyo3::impl_::trampoline::fastcall_with_keywords::{closure_env#0}, *mut pyo3_ffi::object::PyObject>> ()
    at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466
#22 std::panic::catch_unwind<pyo3::impl_::trampoline::trampoline::{closure_env#0}<pyo3::impl_::trampoline::fastcall_with_keywords::{closure_env#0}, *mut pyo3_ffi::object::PyObject>, core::result::Result<*mut pyo3_ffi::object::PyObject, pyo3::err::PyErr>> () at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142
#23 pyo3::impl_::trampoline::trampoline<pyo3::impl_::trampoline::fastcall_with_keywords::{closure_env#0}, *mut pyo3_ffi::object::PyObject> () at /home/bogdan-dm/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.2/src/impl_/trampoline.rs:181
--Type <RET> for more, q to quit, c to continue without paging--c
#24 0x00007ffff1e17f30 in pyo3::impl_::trampoline::fastcall_with_keywords () at /home/bogdan-dm/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.2/src/impl_/trampoline.rs:52
#25 _pydantic_core::validators::_::{impl#0}::py_methods::ITEMS::trampoline () at src/validators/mod.rs:112
#26 0x0000000000579007 in ?? ()
#27 0x0000000000547575 in PyObject_Vectorcall ()
#28 0x0000000000539c51 in _PyEval_EvalFrameDefault ()
#29 0x0000000000581d37 in ?? ()
#30 0x0000000000581743 in ?? ()
#31 0x000000000056c931 in PyObject_Call ()
#32 0x000000000053dc14 in _PyEval_EvalFrameDefault ()
#33 0x00000000005624b3 in _PyFunction_Vectorcall ()
#34 0x000000000053dc14 in _PyEval_EvalFrameDefault ()
#35 0x00000000005624b3 in _PyFunction_Vectorcall ()
#36 0x00007ffff6ebd63d in __Pyx_PyObject_Call (kw=0x7fffb75d2900, arg=0x7fffb75483b0, func=0x7fffba0d07c0) at src/gevent/greenlet.c:27114
#37 __pyx_pf_6gevent_17_gevent_cgreenlet_8Greenlet_42run (__pyx_v_self=0x7fffb75b28e0) at src/gevent/greenlet.c:16087
#38 __pyx_pw_6gevent_17_gevent_cgreenlet_8Greenlet_43run (__pyx_v_self=0x7fffb75b28e0, unused=<optimized out>) at src/gevent/greenlet.c:15988
#39 0x00000000005820f7 in ?? ()
#40 0x0000000000581778 in ?? ()
#41 0x00007ffff72f2bf2 in greenlet::UserGreenlet::inner_bootstrap (this=0x7fffb75699b0, origin_greenlet=<optimized out>, run=0x7fffb75d2fc0) at src/greenlet/TUserGreenlet.cpp:460
#42 0x00007ffff72f4c62 in greenlet::UserGreenlet::g_initialstub (this=0x7fffb75699b0, mark=0x7fffffff9388) at src/greenlet/TUserGreenlet.cpp:311
#43 0x00007ffff72f38a5 in greenlet::UserGreenlet::g_switch (this=0x7fffb75699b0) at src/greenlet/TUserGreenlet.cpp:179
#44 0x00000000005820f7 in ?? ()
#45 0x0000000000581705 in ?? ()
#46 0x00007ffff7b9a795 in gevent_call (loop=0x7ffff45097e0, cb=0x7fffb7478b40) at src/gevent/libev/callbacks.c:182
#47 0x00007ffff7bc6860 in __pyx_f_6gevent_5libev_8corecext_4loop__run_callbacks (__pyx_v_self=0x7ffff45097e0) at src/gevent/libev/corecext.c:8593
#48 0x00007ffff7bca75a in gevent_loop_run_callbacks (__pyx_v_loop=__pyx_v_loop@entry=0x7ffff45097e0) at src/gevent/libev/corecext.c:21052
#49 0x00007ffff7b9ab42 in gevent_run_callbacks (_loop=<optimized out>, watcher=0x7ffff45097f8, revents=<optimized out>) at src/gevent/libev/callbacks.c:225
#50 0x00007ffff7b9ac9b in ev_invoke_pending (loop=0x7ffff7bddf00 <default_loop_struct>) at /tmp/build/gevent/deps/libev/ev.c:3770
#51 0x00007ffff7bc7c7b in ev_run (loop=0x7ffff7bddf00 <default_loop_struct>, flags=0) at /tmp/build/gevent/deps/libev/ev.c:4063
#52 0x00007ffff7bc842e in __pyx_pf_6gevent_5libev_8corecext_4loop_14run (__pyx_v_once=<optimized out>, __pyx_v_nowait=<optimized out>, __pyx_v_self=0x7ffff45097e0) at src/gevent/libev/corecext.c:10119
#53 __pyx_pw_6gevent_5libev_8corecext_4loop_15run (__pyx_v_self=0x7ffff45097e0, __pyx_args=<optimized out>, __pyx_nargs=<optimized out>, __pyx_kwds=<optimized out>) at src/gevent/libev/corecext.c:10069
#54 0x0000000000547575 in PyObject_Vectorcall ()
#55 0x0000000000539c51 in _PyEval_EvalFrameDefault ()
#56 0x0000000000581d37 in ?? ()
#57 0x0000000000581778 in ?? ()
#58 0x00007ffff72f2bf2 in greenlet::UserGreenlet::inner_bootstrap (this=0x7ffff49fedf0, origin_greenlet=<optimized out>, run=0x7fffefe84cc0) at src/greenlet/TUserGreenlet.cpp:460
#59 0x00007ffff72f4c62 in greenlet::UserGreenlet::g_initialstub (this=0x7ffff49fedf0, mark=0x7fffffff9b38) at src/greenlet/TUserGreenlet.cpp:311
#60 0x00007ffff72f38a5 in greenlet::UserGreenlet::g_switch (this=0x7ffff49fedf0) at src/greenlet/TUserGreenlet.cpp:179
#61 0x00007fffffff9d30 in ?? ()
#62 0x0000000000000000 in ?? ()

I can try to compile Python with more debug info if you need too. ~But not sure about Rust, I'm not familiar with it at all and lines 3-10 seem to be pretty important.~ Nevermind, command from your message above works out of a box. So I updated stack trace.

Lib versions:

pydantic-core - commit 1a966d55581e1a1379cfe6274da6323c9786aefb
pydantic==2.4.2
gevent==23.9.1 (installed with cython==3.0.2 and `--no-binary :all:` flag)
Python 3.11.5

stable-x86_64-unknown-linux-gnu (default)
rustc 1.73.0 (cc66ad468 2023-10-03)

Operating System: Ubuntu 22.04.3 LTS              
          Kernel: Linux 6.2.0-35-generic

P.S. This is not gunicorn related crash, I used local django runserver and enabled gevent on server startup (from gevent import monkey; monkey.patch_all())

davidhewitt commented 8 months ago

Hmm, so looks like the call to PyObject_GetItem is crashing, which is quite unexpected. Do you know anything about the model which is being validated when the crash occurs?

That might also imply there is memory corruption earlier in the process. Are you willing to run under valgrind? (I can help figure out an invocation for this.) We should probably also add valgrind to the pydantic-core CI.

bogdandm commented 8 months ago

Nothing specific. It is actually one super large model that describes whole page on one site. I also suspects some memory corruption, at some point I have weird objects that produces totally random errors. When I started investagating them (obj.dict and other usuall staff) - they had random properties from other objects. I.e. simple lazy translation string (gettext_lazy from django) has _proxy____kw attribute with some random object from User model. I have not seen this errors in quite a while, so maybe this was some sort of cache corruption.

I can try valgrind, in local environment it is probably safe enough, you can contact me on linkedin (link in github profile)

davidhewitt commented 8 months ago

I was able to run valgrind on the pydantic-core test suite using a virtual environment on ubuntu with the following command:

valgrind --leak-check=full --track-origins=yes --log-file=valgrind-output.txt python -m pytest

The contents of valgrind-output.txt suggested a couple memory leaks, which look like globally cached strings, so not of relevant concern here. I'll follow up on those separately another time. Hopefully if you can repeat the same thing but replace python -m pytest with your command which produces the repro under gdb, we will identify a cause of your crash. You can share any results with me confidentially over linkedin.

If you're getting a lot of messages, you might want to check if you have /usr/lib/valgrind/python3.supp present, I understand this is needed due to Python's internal memory allocator.

StasEvseev commented 8 months ago

@bogdandm Thanks for jumping on the issue and help with investigation! For me it a little bit troublesome to reproduce on local environment (due to complex setup). Do you need any help to progress further?

bogdandm commented 8 months ago

I contacted @ davidhewitt and give him all logs that I was able to collect from my project. So now all hope is that he will be able to figure it all out 🙏🏻

davidhewitt commented 8 months ago

Yep, I'm looking into this at present and hope to have some progress within a few weeks. Will keep posted here.

rafales commented 7 months ago

Just ran into similar issues. M1, Python 3.12.0 and 3.12.1. Pydantic 2.5.2. It only happens with gevent monkey-patched. I also see that we all are using flask.

rafales commented 7 months ago

So I am getting multiple errors, they seem to be pretty random, but it's mostly SIGSEGV/SIGBUS.

I am also running into SystemError: ../Objects/dictobject.c:1899: bad argument to internal function.

I compiled a debug version of python, and while those errors still happen - a new one started to appear:

Assertion failed: (Py_REFCNT((PyObject*)mp) > 0), function _PyDict_NotifyEvent, file pycore_dict.h, line 169.
Fatal Python error: Aborted

Current thread 0x00000001da509000 (most recent call first):
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/pydantic/main.py", line 503 in model_validate
  File "/Users/rafal/Code/redacted/app/orgs/rpc.py", line 176 in rpc_get_members
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 113 in __call__
  File "/Users/rafal/Code/redacted/app/core/openrpc/server.py", line 193 in _execute_method
  File "/Users/rafal/Code/redacted/app/core/openrpc/server.py", line 162 in execute_by_data
  File "/Users/rafal/Code/redacted/app/core/openrpc/flask.py", line 38 in post
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/flask/views.py", line 190 in dispatch_request
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/flask/views.py", line 115 in view
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/flask/app.py", line 852 in dispatch_request
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/flask/app.py", line 867 in full_dispatch_request
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/flask/app.py", line 1455 in wsgi_app
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/flask/app.py", line 1478 in __call__
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/werkzeug/debug/__init__.py", line 330 in debug_application
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/werkzeug/serving.py", line 325 in execute
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/werkzeug/serving.py", line 362 in run_wsgi
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/http/server.py", line 424 in handle_one_request
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/http/server.py", line 436 in handle
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/werkzeug/serving.py", line 390 in handle
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/socketserver.py", line 761 in __init__
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/socketserver.py", line 362 in finish_request
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/socketserver.py", line 692 in process_request_thread
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/threading.py", line 1010 in run
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/Users/rafal/.pyenv/versions/3.12.1-debug/lib/python3.12/threading.py", line 1030 in _bootstrap
  File "/Users/rafal/Code/redacted/.venv/lib/python3.12/site-packages/gevent/greenlet.py", line 908 in run

Extension modules: _cffi_backend, greenlet._greenlet, markupsafe._speedups (total: 3)

It's a bit weird that faulthandler does not list pydantic core in extensions. Also I'm running this with the following env variables to reduce the amount of c extensions used: export GEVENT_LOOP=libev-cffi PURE_PYTHON=1 DISABLE_SQLALCHEMY_CEXT_RUNTIME=1

samuelcolvin commented 7 months ago

I was able to reproduce this error with @rafales's example from #8392. Thanks so much @rafales, that's really helpful.

@davidhewitt and I will do some further digging, specifically:

samuelcolvin commented 7 months ago

I think it's very likely this is related to https://github.com/gevent/gevent/issues/1819.

My dumb theory: gevent is switching thread when pydantic-core/pyo3 effectively calls getattr on the object, meaning code that expects to be single threaded is being called in different threads.

davidhewitt commented 7 months ago

Ok, some progress here: I can isolate the crash to just PyO3 + gevent, which I've documented in https://github.com/PyO3/pyo3/issues/3668

I will work to figure out next steps from here. We have at least one pathway to a solution (in the new PyO3 API) but maybe there are mitigations we can get across the ecosystem faster.

davidhewitt commented 6 months ago

To follow up with the current state of things: in PyO3 we felt that mitigations are probably impractical from a performance standpoint so we are busy getting the new PyO3 API to a point where it can be used by projects to migrate. This might be a few weeks off still depending on review speed.

rackdon commented 5 months ago

any update withe the state of the problem?

samuelcolvin commented 5 months ago

We need wait for the new pyo3 API/GIL pool. That's getting pretty close, check the progress in the pyo3 repo.

davidhewitt commented 3 months ago

With the release now done in PyO3 0.21, and pydantic-core updated, I can no longer reproduce the crash on pydantic main. I will close this issue, hopefully people experiencing problems here can also confirm it's fixed with pydantic main. We will also release this all soon as Pydantic 2.7!