Open Autoparallel opened 2 months ago
Pinned issue as this is a rather invasive endeavor we should decide upon before integrating to many other changes.
@lonerapier i'm pinging you here because this has upstream effects into the fetcher/interpreter.
I can tackle these changes if need be, but maybe we discuss
Idea
The current way we handle the parsing state does not adequately prevent re-enabling of the state flags
parsing_*
(currently justparsing_string
andparsing_number
). This means that invalid JSON can be provided and intended values can be obfuscated.Example
Consider the (invalid) JSON:
At the moment, the parser will read up to the key
"k"
and upon reading:
will writeto stack position 0.
We will then move through bytes and toggle:
parsing_string
on and off, then toggleparsing_number
on then off, and finally again withparsing_string
we see it toggle.Solution
This can likely be solved by having these parsing states be stack variables in a new index and adding constraints in the following way.
Consider now the sequence:
In this case, we write the additional
1
into position 2 of the stack at height 0 indicating we enter a string value. Upon reading the second"
we increment position 2 of the stack at height 0 to[1,1,2,0]
which indicates we have cleared the string value. At this point, we constrain to not allow any further value parsing.The interaction with a comma would remain mostly unchanged in that we would require, for instance, seeing
[1,1,2,0]
before reading a comma and then we'd see:Also, we can repeat this same process for when we are parsing a key instead of a value, and enforce the key is a string. E.g.,
For instance, in the above explanation if we had instead:
we do the same, using stack position 3 to track
parsing_number
and a 2 written to this position to indicate clearing that value and allowing for no more value-types to be parsed.Future Work
Given #32 and #33 this type of implementation described seems more pleasing and conducive to reducing passing invalid JSON through the parser. In those cases, we can add two more stack indices representing
parsing_null
andparsing_bool
which, for example could go like so:whereby reading any other non-whitespace ASCII after achieving
[1,1,0,0,4,0]
results inFAIL
.Easy differentiation in
true
andfalse
could be handled upon filtering for[x,y,0,0,0,5]
as onlyfalse
can attain 5 in the final stack position here.Edit: 8/30/24
I think we can compress the stack quite a bit, actually. We need only a stack 3 wide if we do the following encoding:
parsing_string
-->[..,..,1]
and[..,..,2]
parsing_number
-->[..,..,10]
and[..,..,20]
parsing_null
-->[..,..,100]
, ...[..,..,400]
parsing_true
" -->[..,..,1000]
,...[..,..,4000]
,parsing_false
" -->[..,..,10000]
,...[..,..,50000]
,These numbers cannot ever overlap with each other in nominal conditions, so we can save
3 x STACK_HEIGHT
numbers of field elements. Likely this same sort of compression could be used elsewhere.