pydantic / jiter

Fast iterable JSON parser.
https://crates.io/crates/jiter
MIT License
182 stars 11 forks source link

v2.7.0b1 `pydantic_core.from_json(..., allow_partial=True)` truncates partial strings completely #82

Closed willbakst closed 3 months ago

willbakst commented 5 months ago

Initial Checks

Description

In the below example, I would expect for the author field to still populate into the model as it's being streamed. For example, consider a description field where it might be on the longer side. This will only give me the description once it's fully formed and closed, which somewhat defeats the purpose of it being partial here for streaming.

Example Code

from pydantic_core import from_json

obj = from_json(
    '{"title": "Pride and Prejudice", "author": "Jane A',
    allow_partial=True,
)
print(obj)
# Actual: {'title': 'Pride and Prejudice'}
# Expected: {'title': 'Pride and Prejudice', 'author': 'Jane A'}

Python, Pydantic & OS Version

pydantic version: 2.7.0b1
pydantic-core version: 2.18.0
pydantic-core build: profile=release pgo=false
python version: 3.9.16 (main, Apr  2 2023, 22:08:02)  [Clang 14.0.0 (clang-1400.0.29.202)]
platform: macOS-14.2.1-arm64-arm-64bit
related packages: typing_extensions-4.10.0 fastapi-0.109.2 pydantic-extra-types-2.6.0 mypy-1.1.1 pydantic-settings-2.2.1
samuelcolvin commented 4 months ago

String parsing in JSON is extremely complex. We might be able to recover from some end of string situations, but it will require a lot of work.

PR welcome to try, but I think this is an optimisation that isn't required for 2.7.

I'm not even sure you'd always want partial strings to be included, for example:

Folyd commented 4 months ago

Same for me:

>>> from_json('{"message": "The **Chihuly Garden and Glass** is open at different times throughout the week. Here are', allow_partial=True)
{}
willbakst commented 4 months ago

Hmm, I see what you're saying, but ultimately I don't expect the parsed partial string to be "correct" until it's done streaming, and I want the partial string during streaming so that I can display the chunks as they stream.

In particular this leads to poor real-time behavior when working the LLM streaming, namely for really long values. For example, let's say I have the following model:

from pydantic import BaseModel, Field

class MyModel(BaseModel):
    description: str = Field(
        ...,
        description="A really long, verbose description about pydantic"
)

If I try to stream this output, I will have to wait for the entire output to be completed before I actually get anything parsed, ultimately defeating the purpose of partial parsing / streaming in this case at all.

If we detect an open string, can we not just parse it as a string as-is by closing it? Does this make it less complex? Or is this too simplified a way of handling things?

I can't really think of a case where I would want to allow for partial parsing outside of streaming where the desired behavior would be as originally described; however, we could always expose an additional boolean to let the user decide if they want the more or less strict partial parsing?

Would be interested in working on a PR for this once aligned on desired behavior.

samuelcolvin commented 3 months ago

This should be fixed in #101, @willbakst can you take try it out and let me know if it solves your problem.

willbakst commented 3 months ago

@samuelcolvin just took a look, the partial_mode='trailing-string' works as expected for my use case!

I'm assuming this will be a similar API change in pydantic-core / pydantic with a switch from allow_partial to partial_mode?

samuelcolvin commented 3 months ago

Yes, we'll probably avoid changing the name of the kwarg in pydantic-core to avoid breaking changes.

In retrospect maybe it was a mistake therefore to change them in jiter.

willbakst commented 3 months ago

Got it, makes sense.

Mostly unaffected by this since I'll just update to use jiter directly

samuelcolvin commented 3 months ago

Yup, I think pydantic_core.from_json is pretty redundant now, it's also about 30% slower than jiter due to the PGO test cases used.