Open jules-ch opened 3 months ago
Hi, we'll need to see the model you're using, and an example from of the input data to be able to help.
Also, how are you instrumenting the endpoint and validation? How many spans are being generated?
Example repro:
from __future__ import annotations
from datetime import date, datetime
from enum import Enum
from typing import Dict, List, Optional
from uuid import UUID
from polyfactory.factories.pydantic_factory import ModelFactory
from pydantic import BaseModel, EmailStr, HttpUrl, confloat, conint
import logfire
logfire.configure(
pydantic_plugin=logfire.PydanticPlugin(
record='all',
# exclude={'Platform'},
)
)
class Status(str, Enum):
ACTIVE = 'active'
INACTIVE = 'inactive'
PENDING = 'pending'
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
country: str
class Contact(BaseModel):
email: EmailStr
phone: str
address: Address
print(
Contact(
email='foo@bar.com',
phone='123-456-7890',
address=Address(street='123 Main St', city='Springfield', state='IL', zip_code='62701', country='USA'),
)
)
class SocialMedia(BaseModel):
platform: str
username: str
url: HttpUrl
class Education(BaseModel):
institution: str
degree: str
field_of_study: str
start_date: date
end_date: Optional[date]
gpa: confloat(ge=0, le=4)
class WorkExperience(BaseModel):
company: str
position: str
start_date: date
end_date: Optional[date]
responsibilities: List[str]
achievements: List[str]
class Skill(BaseModel):
name: str
level: conint(ge=1, le=10)
years_of_experience: float
class Project(BaseModel):
name: str
description: str
start_date: date
end_date: Optional[date]
technologies: List[str]
team_size: int
role: str
url: Optional[HttpUrl]
class Certificate(BaseModel):
name: str
issuer: str
date_issued: date
expiration_date: Optional[date]
credential_id: str
class Profile(BaseModel):
id: UUID
username: str
password: str
first_name: str
last_name: str
date_of_birth: date
gender: str
nationality: str
bio: str
profile_picture: HttpUrl
cover_photo: Optional[HttpUrl]
status: Status
created_at: datetime
updated_at: datetime
is_verified: bool
contact: Contact
social_media: List[SocialMedia]
education: List[Education]
work_experience: List[WorkExperience]
skills: List[Skill]
projects: List[Project]
certificates: List[Certificate]
languages: Dict[str, conint(ge=1, le=5)]
interests: List[str]
achievements: List[str]
references: List[str]
class Company(BaseModel):
id: UUID
name: str
industry: str
founded_year: int
description: str
website: HttpUrl
logo: HttpUrl
headquarters: Address
employees: conint(ge=1)
revenue: Optional[float]
status: Status
contact: Contact
social_media: List[SocialMedia]
class Job(BaseModel):
id: UUID
title: str
company: Company
location: Address
job_type: str
salary_range: Dict[str, float]
description: str
requirements: List[str]
benefits: List[str]
posted_date: date
deadline: date
is_remote: bool
experience_level: str
education_level: str
skills_required: List[str]
application_url: HttpUrl
class Application(BaseModel):
id: UUID
job: Job
applicant: Profile
resume: HttpUrl
cover_letter: str
applied_date: datetime
status: Status
interview_date: Optional[datetime]
notes: Optional[str]
score: Optional[float]
class Review(BaseModel):
id: UUID
reviewer: Profile
reviewee: Profile
job: Job
rating: confloat(ge=0, le=5)
comment: str
date: datetime
class Message(BaseModel):
id: UUID
sender: Profile
recipient: Profile
content: str
timestamp: datetime
is_read: bool
attachments: Optional[List[HttpUrl]]
class Notification(BaseModel):
id: UUID
user: Profile
type: str
content: str
timestamp: datetime
is_read: bool
related_object: Optional[Dict[str, str]]
class Platform(BaseModel):
name: str
version: str
users: List[Profile]
companies: List[Company]
jobs: List[Job]
applications: List[Application]
reviews: List[Review]
messages: List[Message]
notifications: List[Notification]
active_users: conint(ge=0)
total_job_postings: conint(ge=0)
total_applications: conint(ge=0)
average_rating: confloat(ge=0, le=5)
last_updated: datetime
class PlatformFactory(ModelFactory[Platform]): ...
platform = PlatformFactory.build()
platform.users = platform.users * 50
platform.companies = platform.companies * 50
platform.jobs = platform.jobs * 50
platform.applications = platform.applications * 50
platform.reviews = platform.reviews * 50
platform.messages = platform.messages * 50
platform.notifications = platform.notifications * 50
dump = platform.model_dump(mode='json')
print(len(str(dump))) # about 1 MB
with logfire.span('timer'): # on the order of 1 second
platform2 = Platform(**dump)
Uncommenting exclude={'Platform'}
at the top makes this fast again. I recommend doing something similar for your response model.
There might be things we can do to make this faster, but ultimately there's a lot of processing when logging any large object, and the pydantic plugin logs both the input and the output. In addition, huge attributes like this ultimately get forcibly truncated, so until we build more functionality there isn't much benefit in logging them.
Here is an example:
from typing import Annotated
from pydantic import BaseModel, Field
import orjson
import numpy as np
import logfire
logfire.configure(
token="[redacted]",
service_name="test_pydantic",
)
class TrackStreamResponse(BaseModel):
latlon: Annotated[
# list[tuple[float, float]],
list[list[float]],
Field(
examples=[[[43.5, 5.2]]],
default_factory=list,
description="Position stream as list of latitude & longitude",
),
]
n = 40000
# Generate random latitude and longitude values
latitudes = np.random.uniform(-90, 90, n)
longitudes = np.random.uniform(-180, 180, n)
# Combine them into an n x 2 array
lat_lon_array = np.column_stack((latitudes, longitudes))
print("Array size: ",lat_lon_array.nbytes / 2**20, "MiB")
content = {"latlon": lat_lon_array}
json_ = orjson.dumps(
content,
option=orjson.OPT_NON_STR_KEYS | orjson.OPT_SERIALIZE_NUMPY,
)
print("JSON size: ",len(json_) / 2**20, "MiB")
with logfire.span('Validation span') as span:
span.set_attribute("model.size", f"{len(json_) / 2**20:.6f} MiB")
TrackStreamResponse.model_validate_json(json_)
Yes I get that logging and exporting those attributes in traces is the bottleneck. I think this should be options in the Pydantic plugin. Since it is dependent of user data and is potential real life scenario. Might be a good idea to truncate, limit size or just exclude attributes based on some options.
Like I said, there is an option which I recommend: exclude TrackStreamResponse
from the pydantic plugin.
If I do that I don't have any traces about pydantic validation for this model 😅, thanks for your answer anyway. I will just use manual spans, that will be easier this way.
Thank you :)
We should make it easier to skip parts or a model, or only send input or validated data, or just errors.
What would your preference be @jules-ch?
I might be biased here. I see several options:
Anyway it can be easily done with pydantic machinery. I think we need to be wary of any userdata that can clog the Opentelemetry exporter.
errors
function has a kwargs include_input
For truncation Opentelemetry SDKs comes with this built in using OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT
env variable. Though in this case, it seems that the generations of the strings might be the bottleneck.
It is Infinite
by default
Another user wanting to exclude a specific field: https://pydanticlogfire.slack.com/archives/C06EDRBSAH3/p1729156611944299
Thanks, @alexmojaki - yep, that's me. Would be amazing, if I can log my objects, and define on some level the fields I want to exclude.
For now I am just going to log all needed fields of an object on my end.
Question
Hello thanks for this package experimenting on instrumenting some api I'm developping.
I just noticed that instrumenting endpoint with validation that validate big response in size. Here 1.2MB. Validation takes 10x more time to process.
I have some Server-Timing headers to help me + Opentelmetry traces.
With pydantic instrumentation:
Without: