Instrumentation of pydantic validation - Slow on big response size

jules-ch commented 3 months ago

Question

Hello thanks for this package experimenting on instrumenting some api I'm developping.

I just noticed that instrumenting endpoint with validation that validate big response in size. Here 1.2MB. Validation takes 10x more time to process.

I have some Server-Timing headers to help me + Opentelmetry traces.

With pydantic instrumentation:

Without:

samuelcolvin commented 3 months ago

Hi, we'll need to see the model you're using, and an example from of the input data to be able to help.

Also, how are you instrumenting the endpoint and validation? How many spans are being generated?

alexmojaki commented 3 months ago

Example repro:

from __future__ import annotations

from datetime import date, datetime
from enum import Enum
from typing import Dict, List, Optional
from uuid import UUID

from polyfactory.factories.pydantic_factory import ModelFactory
from pydantic import BaseModel, EmailStr, HttpUrl, confloat, conint

import logfire

logfire.configure(
    pydantic_plugin=logfire.PydanticPlugin(
        record='all',
        # exclude={'Platform'},
    )
)

class Status(str, Enum):
    ACTIVE = 'active'
    INACTIVE = 'inactive'
    PENDING = 'pending'

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str
    country: str

class Contact(BaseModel):
    email: EmailStr
    phone: str
    address: Address

print(
    Contact(
        email='foo@bar.com',
        phone='123-456-7890',
        address=Address(street='123 Main St', city='Springfield', state='IL', zip_code='62701', country='USA'),
    )
)

class SocialMedia(BaseModel):
    platform: str
    username: str
    url: HttpUrl

class Education(BaseModel):
    institution: str
    degree: str
    field_of_study: str
    start_date: date
    end_date: Optional[date]
    gpa: confloat(ge=0, le=4)

class WorkExperience(BaseModel):
    company: str
    position: str
    start_date: date
    end_date: Optional[date]
    responsibilities: List[str]
    achievements: List[str]

class Skill(BaseModel):
    name: str
    level: conint(ge=1, le=10)
    years_of_experience: float

class Project(BaseModel):
    name: str
    description: str
    start_date: date
    end_date: Optional[date]
    technologies: List[str]
    team_size: int
    role: str
    url: Optional[HttpUrl]

class Certificate(BaseModel):
    name: str
    issuer: str
    date_issued: date
    expiration_date: Optional[date]
    credential_id: str

class Profile(BaseModel):
    id: UUID
    username: str
    password: str
    first_name: str
    last_name: str
    date_of_birth: date
    gender: str
    nationality: str
    bio: str
    profile_picture: HttpUrl
    cover_photo: Optional[HttpUrl]
    status: Status
    created_at: datetime
    updated_at: datetime
    is_verified: bool
    contact: Contact
    social_media: List[SocialMedia]
    education: List[Education]
    work_experience: List[WorkExperience]
    skills: List[Skill]
    projects: List[Project]
    certificates: List[Certificate]
    languages: Dict[str, conint(ge=1, le=5)]
    interests: List[str]
    achievements: List[str]
    references: List[str]

class Company(BaseModel):
    id: UUID
    name: str
    industry: str
    founded_year: int
    description: str
    website: HttpUrl
    logo: HttpUrl
    headquarters: Address
    employees: conint(ge=1)
    revenue: Optional[float]
    status: Status
    contact: Contact
    social_media: List[SocialMedia]

class Job(BaseModel):
    id: UUID
    title: str
    company: Company
    location: Address
    job_type: str
    salary_range: Dict[str, float]
    description: str
    requirements: List[str]
    benefits: List[str]
    posted_date: date
    deadline: date
    is_remote: bool
    experience_level: str
    education_level: str
    skills_required: List[str]
    application_url: HttpUrl

class Application(BaseModel):
    id: UUID
    job: Job
    applicant: Profile
    resume: HttpUrl
    cover_letter: str
    applied_date: datetime
    status: Status
    interview_date: Optional[datetime]
    notes: Optional[str]
    score: Optional[float]

class Review(BaseModel):
    id: UUID
    reviewer: Profile
    reviewee: Profile
    job: Job
    rating: confloat(ge=0, le=5)
    comment: str
    date: datetime

class Message(BaseModel):
    id: UUID
    sender: Profile
    recipient: Profile
    content: str
    timestamp: datetime
    is_read: bool
    attachments: Optional[List[HttpUrl]]

class Notification(BaseModel):
    id: UUID
    user: Profile
    type: str
    content: str
    timestamp: datetime
    is_read: bool
    related_object: Optional[Dict[str, str]]

class Platform(BaseModel):
    name: str
    version: str
    users: List[Profile]
    companies: List[Company]
    jobs: List[Job]
    applications: List[Application]
    reviews: List[Review]
    messages: List[Message]
    notifications: List[Notification]
    active_users: conint(ge=0)
    total_job_postings: conint(ge=0)
    total_applications: conint(ge=0)
    average_rating: confloat(ge=0, le=5)
    last_updated: datetime

class PlatformFactory(ModelFactory[Platform]): ...

platform = PlatformFactory.build()
platform.users = platform.users * 50
platform.companies = platform.companies * 50
platform.jobs = platform.jobs * 50
platform.applications = platform.applications * 50
platform.reviews = platform.reviews * 50
platform.messages = platform.messages * 50
platform.notifications = platform.notifications * 50

dump = platform.model_dump(mode='json')
print(len(str(dump)))  # about 1 MB
with logfire.span('timer'):  # on the order of 1 second
    platform2 = Platform(**dump)

Uncommenting exclude={'Platform'} at the top makes this fast again. I recommend doing something similar for your response model.

There might be things we can do to make this faster, but ultimately there's a lot of processing when logging any large object, and the pydantic plugin logs both the input and the output. In addition, huge attributes like this ultimately get forcibly truncated, so until we build more functionality there isn't much benefit in logging them.

jules-ch commented 3 months ago

Here is an example:


from typing import Annotated

from pydantic import BaseModel, Field
import orjson

import numpy as np
import logfire

logfire.configure(
    token="[redacted]",
    service_name="test_pydantic",
)

class TrackStreamResponse(BaseModel):
    latlon: Annotated[
        # list[tuple[float, float]],
        list[list[float]],
        Field(
            examples=[[[43.5, 5.2]]],
            default_factory=list,
            description="Position stream as list of latitude & longitude",
        ),
    ]

n = 40000
# Generate random latitude and longitude values
latitudes = np.random.uniform(-90, 90, n)
longitudes = np.random.uniform(-180, 180, n)

# Combine them into an n x 2 array
lat_lon_array = np.column_stack((latitudes, longitudes))
print("Array size: ",lat_lon_array.nbytes / 2**20, "MiB")

content = {"latlon": lat_lon_array}

json_ = orjson.dumps(
            content,
            option=orjson.OPT_NON_STR_KEYS | orjson.OPT_SERIALIZE_NUMPY,
        )
print("JSON size: ",len(json_) / 2**20, "MiB")
with logfire.span('Validation span') as span:
    span.set_attribute("model.size", f"{len(json_) / 2**20:.6f} MiB")
    TrackStreamResponse.model_validate_json(json_)

jules-ch commented 3 months ago

Yes I get that logging and exporting those attributes in traces is the bottleneck. I think this should be options in the Pydantic plugin. Since it is dependent of user data and is potential real life scenario. Might be a good idea to truncate, limit size or just exclude attributes based on some options.

alexmojaki commented 3 months ago

Like I said, there is an option which I recommend: exclude TrackStreamResponse from the pydantic plugin.

jules-ch commented 3 months ago

If I do that I don't have any traces about pydantic validation for this model 😅, thanks for your answer anyway. I will just use manual spans, that will be easier this way.

Thank you :)

samuelcolvin commented 3 months ago

We should make it easier to skip parts or a model, or only send input or validated data, or just errors.

What would your preference be @jules-ch?

jules-ch commented 3 months ago

I might be biased here. I see several options:

Keep things minimal and opt in for input data in errors, opt in for input and result in validation success.
Opt out
truncate errors like it can be done in the console with pydantic (trickier)

Anyway it can be easily done with pydantic machinery. I think we need to be wary of any userdata that can clog the Opentelemetry exporter.

https://github.com/pydantic/logfire/blob/7aec158afec9d1e2d1cb5c4b6520c5e09ba3a891/logfire/integrations/pydantic.py#L186-L215

errors function has a kwargs include_input

jules-ch commented 3 months ago

For truncation Opentelemetry SDKs comes with this built in using OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT env variable. Though in this case, it seems that the generations of the strings might be the bottleneck.

It is Infinite by default

alexmojaki commented 1 month ago

Another user wanting to exclude a specific field: https://pydanticlogfire.slack.com/archives/C06EDRBSAH3/p1729156611944299

0xRaduan commented 1 month ago

Thanks, @alexmojaki - yep, that's me. Would be amazing, if I can log my objects, and define on some level the fields I want to exclude.

For now I am just going to log all needed fields of an object on my end.

pydantic / logfire