pydantic / pydantic

Data validation using Python type hints
https://docs.pydantic.dev
MIT License
19.56k stars 1.76k forks source link

Memory leak in pydantic model_validate from sqlalchemy model #9429

Open pmithrandir opened 1 month ago

pmithrandir commented 1 month ago

Initial Checks

Description

When creating a pydantic object from an sqlalchemy object using model_validate, I see only increase of memory usage. I deep dived the issue and I think there might be a problem with model_validate method.

The following test create pydantic and sqlalchemy model the way fastapi does. When used that way, the memory is never released.

When using model_validate(orm_object.dict) the memory is released, but I don't think we are supposed to be forced to create object from dict, and it doesn't work weel with nested objects.

Do you have any idea ?

regards, Pierre

PS : on my application, it results in 4-5GB of ram being used during DB saving... never released.

Example Code

import gc

import psutil
import pytest
from pydantic import BaseModel, Field, ConfigDict
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.orm import declarative_base, Session, sessionmaker

"""
SQL ALCHEMY Definition
"""
TestBase = declarative_base()

class OrmClass(TestBase):
    __tablename__ = "orm_class"

    id = Column(Integer, primary_key=True, index=True)
    value = Column(String(100), nullable=False)

class PydanticModelBaseABC(BaseModel):
    model_config = ConfigDict(from_attributes=True, validate_assignment=False)
    value: str

class PydanticModelCreate(PydanticModelBaseABC):
    ...

class PydanticModel(PydanticModelBaseABC):
    id: int = Field(
        description='Auto-assigned object id - retrieved from orm-assigned `id` attribute')

@pytest.fixture(scope='function')
def db_test_session():
    engine = create_engine(url="sqlite://")
    TestBase.metadata.create_all(bind=engine)
    db: Session = sessionmaker(autocommit=False, autoflush=False, bind=engine)()
    yield db

class TestPydanticSqlAlchemyMemoryLeak:
    def test_memory_leak_pure_pydantic_sqlalchemy(self, db_test_session):
        process = psutil.Process()
        gc.collect()
        pydantic_create_object = PydanticModelCreate(value="Memory House 2")
        orm_object = OrmClass(**pydantic_create_object.model_dump())
        db_test_session.add(orm_object)
        db_test_session.flush()
        db_test_session.refresh(orm_object)
        memory_start = process.memory_info().rss
        pydantic_object = PydanticModel.model_validate(orm_object)
        del pydantic_object
        gc.collect(0)
        gc.collect(1)
        gc.collect(2)
        memory_after_del = process.memory_info().rss
        assert memory_after_del == memory_start

Python, Pydantic & OS Version

sqlalchemy 1.4
pydantic 2.7.1
python 3.11.5
sydney-runkle commented 1 month ago

@pmithrandir,

Thanks for the report. I haven't taken an in-depth look yet, but maybe @davidhewitt has some insight.

pmithrandir commented 1 month ago

Hi,

I'll wait for his feedback. On a note side, when calling get_referrers on the pydantic_object, I get a link to the orm object. Doesn't look OK.

pmithrandir commented 1 month ago

Hello @davidhewitt do you have any hint to share ?

davidhewitt commented 1 month ago

I have taken a brief look but I cannot reproduce this on any combination of sqlalchemy 1.4 or 2, Python 3.11 or 3.12 or pydantic 2.7 or main.

If you have another example of what you think is the problematic interaction I can take a further look.

pmithrandir commented 1 month ago

Hello,

Thank you for testing it.

You mean that the test passes on your configuration ?

We have sql alchemy 1.4, pydantic 2.7.1 and python 3.11.5 on windows. (servers on red hat)

My expected result would be that all memory is freed up after the pydantic object if removed.

If you could share your python version, it could be nice. Would help me to test if something changes depends on that.

Rergards, Pierre

davidhewitt commented 1 month ago

To check, you're testing on Windows? I tried 3.11 and 3.12, both on Ubuntu.

davidhewitt commented 1 month ago

(Yes, the test passes for me)

pmithrandir commented 1 month ago

which version of 3.11 are you using please ? I'll try with the same one to see if something pops up

davidhewitt commented 1 month ago

3.11.6

pmithrandir commented 1 month ago

Hi,

We tested on windows with 3.11.5, 3.11.6 and 3.11.9 and it's failing each time.

On our linux server, it passes on 3.11.5. It looks like python is not behaving the same way on windows and linux.

davidhewitt commented 1 month ago

I will try to reproduce on Windows some time soon, most likely next week.

donnellythomas commented 3 weeks ago

Any luck here for anyone? I've also tracked a memory leak down to model_validate and am only using pydantic with json objects. Tracked using memory_profiler

> Line #    Mem usage    Increment  Occurrences   Line Contents
>    243    816.3 MiB      0.0 MiB           1           response = await github_trees_request(owner, repo, default_branch, session)
>    244    816.3 MiB      0.0 MiB           1           body = response.body
>    245                                                 # body = json.loads(body)
>    246                                                 # tree = []
>    247                                                 # for item in body["tree"]:
>    248                                                 #     tree.append(BaseGithubTreesItem(path=item["path"], mode=item["mode"], type=item["type"], sha=item["sha"], size=item["size"], url=item["url"]))
>    249    816.8 MiB      0.5 MiB           1           model = BaseGithubTreesResponse.model_validate_json(response.body)
>    250    816.8 MiB      0.0 MiB           1           return model

Weirdly enough if I uncomment the commented lines the leak goes away, even though tree is not being used anywhere. If I remove the size=item["size"] from the BaseGithubTreesItem Initialization the leak comes back for line 249.

pmithrandir commented 3 weeks ago

Hi,

We still have it, and so far, we were unable to find how to resolve it. We ended up poping sub processes in order to clean up memory when the subprocess is killed.