Open gislerro opened 1 month ago
Humm, I think the behaviour we have now is fine.
I think it’s perfectly reasonable for 1.
to be interpreted as an invalid value and dropped (sp [1.1, 2.2, 3.
is interpreted as [1.1, 2,2]
), same as fal
is invalid (so [true, null, fal
is interpretted as [true, null]
) even though fal
could be part of a valid value when it completes.
or presumably 123e
will be dropped from partial parsing even though it could be the beginning of a value number.
I think the correct thing to do is drop any incomplete number.
The thing is that 10
is not similar to or a good proxy for 1024
in the way that "Samuel is a"
is a useful stand-in for "Samuel is aging"
.
The thing is that
10
is not similar to or a good proxy for1024
in the way that"Samuel is a"
is a useful stand-in for"Samuel is aging"
.
I believe 1.
is a good enough proxy for 1
and it's not the same as fal
- false
.
Let me explain:
1.1
character by character you get the following numbers: [1, undefined, 1.1]
false
-> [undefined, undefined, undefined, undefined, False]
so for floats there's a jump from defined to undefined and back, whereas for the boolean it stays undefined until the parsing succeedsFor some context, im using jiter
to parse a json stream generated by an LLM, the parsed dictionary is then dumped into a new pydantic model when the output from jiter
changes.
An assumption I'd like to make is that when jiter
parses a key for the first time- the key/field is then present in all subsequent models.
Of course I could enforce this myself by detecting when a parsed dictionary drops a key and then just readd the previous parse, but I thought it would be reasonable to add this to jiter
itself since it supports a partial-string
mode
so for floats there's a jump from defined to undefined and back, whereas for the boolean it stays undefined until the parsing succeeds
The point above is that we think it would be better to remove the jump by having the parse be undefined until we're sure the number is terminated. Agreed that the jump
is bad, and
An assumption I'd like to make is that when
jiter
parses a key for the first time- the key/field is then present in all subsequent models.
Yes, I think we agree with this too 👍
In our proposal, the parse would change to:
on: { -> {}
on: {" -> {}
on: {"f -> {}
on: {"fl -> {}
on: {"flo -> {}
on: {"floa -> {}
on: {"float -> {}
on: {"float" -> {}
on: {"float": -> {}
on: {"float": -> {}
on: {"float": 1 -> {}
on: {"float": 1. -> {}
on: {"float": 1.0 -> {}
on: {"float": 1.0} -> {'float': 1.0}
I've noticed that when iteratively parsing a json stream character by character 'incomplete' floats are 'stepped' over:
prints the following output:
I'd suggest to coerce
1.
into the float1.0
(or expose an option to enable that)