jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.43k stars 75 forks source link

Segmentation fault when using msgspec.Meta #717

Open sosna opened 3 months ago

sosna commented 3 months ago

Description

Hey there,

First of all, thanks for this very nice library.

I wanted to report an issue, that seems to be related to msgspec, but you will be able to better assess whether that is indeed the case.

Our CI pipelines in GitHub started failing randomly (say, 20% failure rate), with a segmentation fault error. The failures disappeared after rerunning the failing jobs (one or more times depending on our luck). The failure was when running pytest (using python -m pytest). After setting python execution to a verbose mode, we noticed that, when it failed, then it always failed when destroying the following module: pysdmx.model.types. This module uses msgspec.Meta.

(many more lines above)
# destroy pysdmx.model.types

Fatal Python error: Segmentation fault
2024-08-06T14:19:01.4875904Z 
2024-08-06T14:19:01.4876092Z Current thread 0x00007f71d8383000 (most recent call first):
2024-08-06T14:19:01.4876697Z   Garbage-collecting
2024-08-06T14:19:01.4876943Z   <no Python frame>
2024-08-06T14:19:02.9130965Z /home/runner/work/_temp/7afba60b-b9e9-4f16-9741-aa8c0d7ce98d.sh: line 1:  1920 Segmentation fault      (core dumped) poetry run python -X faulthandler -v -m pytest -p no:faulthandler --full-trace tests/api/fmr

This module was very simple:

from typing import Annotated

from msgspec import Meta

NC_NAME_ID_TYPE = Annotated[str, Meta(pattern=r"^[A-Za-z][A-Za-z\d_-]*$")]

Once we removed this, our CI pipelines started working again, 100% of the time. So, our suspicion is that there might be cases when using msgspec.Meta prevents created objects from being destroyed, but this is just a guess based on the log above and the fact that the error did not occur again when we stopped using that class. However, it could also be that our suspicion is wrong. In which case, feel free to close this issue :-).

Thanks, Xavier

jcrist commented 3 months ago

Hi, thanks for opening this. Segfaults are indeed concerning.

I'm unable to reproduce this locally. Doing an audit of the dealloc/visit/clear methods for Meta turns up one code path where we were failing to decref properly, which would lead to a leak of the regex object on dealloc, not a segfault. Still worth fixing, but not the cause.

Can you share more about what you changed (how was NC_NAME_ID_TYPE being used before, what kinds of objects referenced it, ...)?

Another debugging strategy - do you see the same segfaults if you replace pattern=... with some other constraint like min_length=1 or something? That would help determine if the segfault was related to our regex handling.

Alternatively, if you were able to provide a reproducible example that demonstrates the segfault, that would help me debug locally :).

sosna commented 3 months ago

Hi @jcrist. Thanks for your swift reply :)

Re. how was NC_NAME_ID_TYPE being used before: It was used as type hint on a property of a Struct. I simply replaced the type to str for the time being. You can see an example of how it was used here: https://github.com/bis-med-it/pysdmx/blob/60692119bc5f6a4998cf7c3f85e4aa040d102414/src/pysdmx/api/qb/data.py#L337

Re. other types of Meta: Indeed, I searched for other instances and they appear to work fine. For example, we have snippets like the one below, and our CI pipeline does not complain. So, it could indeed be related to the regex handling, as you hinted.

first_n_obs: Optional[Annotated[int, msgspec.Meta(gt=0)]] = None

Re. the reproducible example: I have forked the project into my account to continue the debugging. You can access the list of commits in the appropriate branch here: https://github.com/sosna/pysdmx/commits/various_fixes/. You can see that the CI job for last commit (which removes the NC_NAME_ID_TYPE) succeeded (I ran it multiple times, to be sure), while the CI jobs for all the previous ones failed.

I hope this helps, but else, let me know :)