pluto / parser-attestor

Circuits for parsing, locking, and extracting from various widely-used formats including JSON and HTTP.
Apache License 2.0
12 stars 1 forks source link

bug(json): re-enabling of `parsing_*` is possible #43

Open Autoparallel opened 2 months ago

Autoparallel commented 2 months ago

Idea

The current way we handle the parsing state does not adequately prevent re-enabling of the state flags parsing_* (currently just parsing_string and parsing_number). This means that invalid JSON can be provided and intended values can be obfuscated.

Example

Consider the (invalid) JSON:

{
    "k": "v" 123 "v",
}

At the moment, the parser will read up to the key "k" and upon reading: will write

[1,1]

to stack position 0.

We will then move through bytes and toggle: parsing_string on and off, then toggle parsing_number on then off, and finally again with parsing_string we see it toggle.

Solution

This can likely be solved by having these parsing states be stack variables in a new index and adding constraints in the following way.

Consider now the sequence:

read `:`  --> [1,1,0,0]
read `"` --> [1,1,1,0]
read `"` --> [1,1,2,0]
read `1` --> FAIL
read `"` --> FAIL

In this case, we write the additional 1 into position 2 of the stack at height 0 indicating we enter a string value. Upon reading the second " we increment position 2 of the stack at height 0 to [1,1,2,0] which indicates we have cleared the string value. At this point, we constrain to not allow any further value parsing.

The interaction with a comma would remain mostly unchanged in that we would require, for instance, seeing [1,1,2,0] before reading a comma and then we'd see:

read `:`  --> [1,1,0,0]
read `"` --> [1,1,1,0]
read `"` --> [1,1,2,0]
read `,` --> [1,0,0,0]

Also, we can repeat this same process for when we are parsing a key instead of a value, and enforce the key is a string. E.g.,

read `{` --> [1,0,0,0]
read `"` --> [1,0,1,0]
read `"` --> [1,0,2,0]
read `:` --> [1,1,0,0]

For instance, in the above explanation if we had instead:

read `:`  --> [1,1,0,0]
read `1` --> [1,1,0,1]
read ` ` --> [1,1,0,2]
read `1` --> FAIL
read `"` --> FAIL

we do the same, using stack position 3 to track parsing_number and a 2 written to this position to indicate clearing that value and allowing for no more value-types to be parsed.

Future Work

Given #32 and #33 this type of implementation described seems more pleasing and conducive to reducing passing invalid JSON through the parser. In those cases, we can add two more stack indices representing parsing_null and parsing_bool which, for example could go like so:

read `:` --> [1,1,0,0,0,0]
read `n` --> [1,1,0,0,1,0]
read `u` --> [1,1,0,0,2,0]
read `l` --> [1,1,0,0,3,0]
read `l` --> [1,1,0,0,4,0]

whereby reading any other non-whitespace ASCII after achieving [1,1,0,0,4,0] results in FAIL.

Easy differentiation in true and false could be handled upon filtering for [x,y,0,0,0,5] as only false can attain 5 in the final stack position here.


Edit: 8/30/24

I think we can compress the stack quite a bit, actually. We need only a stack 3 wide if we do the following encoding:

These numbers cannot ever overlap with each other in nominal conditions, so we can save 3 x STACK_HEIGHT numbers of field elements. Likely this same sort of compression could be used elsewhere.

Autoparallel commented 2 months ago

Pinned issue as this is a rather invasive endeavor we should decide upon before integrating to many other changes.

Autoparallel commented 2 months ago

@lonerapier i'm pinging you here because this has upstream effects into the fetcher/interpreter.

I can tackle these changes if need be, but maybe we discuss