Closed Leonidas-from-XIV closed 3 months ago
Hello, I would like to help. I guess we want to extend the current parser to support JSON5 right? I will take a look at the code
yeah, I think the idea would be to have a separate AST (and parser?) for JSON5 :). The AST is richer because there's things like datetimes, if I recall correctly.
Oh okay, I will try it and open a PR when get something
Yes, you'd need a separate AST. And last time I looked at it we used a separate parser because for lexing it made a lot of sense to use sedlex
as you need to match on all the character classes that the spec wants you to support: https://github.com/ocaml-community/yojson/compare/master...issuu:yojson:json5-parser-implementation
I'm tagging my self so I got a easy link to this @dhilst
I see that @issuu started to work on this, should I get from where issuu stopped?
@dhilst It is probably a decent point to start but I am not sure how far they got.
I don't think we necessarily need a new AST type for JSON5. The short example showing the JSON5 extensions from https://json5.org/ is:
{
// comments
unquoted: 'and you can quote me on that',
singleQuotes: 'I can use "double quotes" here',
lineBreaks: "Look, Mom! \
No \\n's!",
hexadecimal: 0xdecaf,
leadingDecimalPoint: .8675309, andTrailing: 8675309.,
positiveSign: +1,
trailingComma: 'in objects', andIn: ['arrays',],
"backwardsCompatible": "with JSON",
}
As with legacy JSON, it's not clear what precision must be supported on numbers. It would be good to take a look at what the JavaScript implementations do with hexadecimal literals, since the default number representation uses 64-bit floats, with only 52 bits usable for integers. The OCaml world isn't much better since the int
type accommodates either 31 or 63 bytes depending on the platform. I suspect this makes hexadecimal constants nearly unusable in practice, so we might want to introduce a new Hex
variant (Hex of int64
?), much like we have already Int
and Float
rather than just Number
.
The hexadecimal floats are well supported in OCaml:
# 0xdecaf. ;;
- : float = 912559.
There's even a notation to go with it for hex exponents!
# 0xdecafp42;;
- : float = 4.01347692612655514e+18
But imho, the potential problem with json5 is that we may want to preserve comments (to parse a config file + reprint it). Perhaps that's too ambitious. If that's out of scope then the current AST is fine.
Can we extend the current AST adding comments? It's not an API breaking change it is it?
Modifying the main internal AST, perhaps not. If you modify one of the main types (like Yojson.Safe.t
or Yojson.Basic.t
it is most definitely a big breaking change. Lots of libraries (and code) explicitly pattern match on it :)
I see, yeah so it would be better to have a new AST, it should be simple to write a function from new AST to the old, it will not be total though
Hello I have a question, I was trying to preserve the comments when I run into this case {/*foo*/}
, what would be the AST for this case? (I created a new type for the AST adding `Comment of string
into it to preserve the comments).
A comment may appear in these positions inside an object and an array, (the array case is easy to handle I don't have to modify the list constructor, but for the object (and the `Assoc
constructor) I don't have a good idea, generalizing it to keep the comments (in their original position) would make the user need to handle a comment in every position it seems annoying to me
{ /* comment */ foo /* comment */ : /* comment */ "bar" /* comment */} // comment
[/* this is the easy case */]
I think we should drop the comments from the AST by now and use the old AST as this would simplify most of the changes
I agree that there is no good way to have it in the current AST without making it an extremely convoluted structure.
If the location of the JSON nodes was kept around, we could collect the comments in a separate list together with their location. Then, a hypothetical pretty-printer could choose to reinsert comments at the correct location. The AST type would look like this:
type t = [
| `Null of loc
| `Bool of (loc * bool)
| `Int of (loc * int)
| `Intlit of (loc * string)
| `Float of (loc * float)
| `String of (loc * string)
| `Assoc of (loc * (string * t) list)
| `List of (loc * t list)
]
type full = {
json : t;
comments : (loc * string) list;
}
The type of a parsing function would be something like this:
val from_string : string -> full
Of course, this comes with compatibility and performance issues due to the introduction of the locations.
I would like to keep the old AST to simplify the PR diff, so we can focus on having parsing JSON5, then we can add other things on top of the new parser. I have some tests passing, I'm doing minor tweaks (like not preserving the quotes in the strings), also I was thinking about supporting the Safe
interface first and then adding the Raw
and Basic
(I'm using the Basic
for now)
A passing test example
let expected = `Assoc [
("unquoted", `String "'and you can quote me on that'");
("singleQuotes", `String "'I can use \"double quotes\" here'");
("lineBreaks", `String ({|"Look, Mom! \
No \\n's!"|}));
("hexadecimal", `Int 0xdecaf);
("leadingDecimalPoint", `Float 0.8675309);
("andTrailing", `Float 8675309.0);
("positiveSign", `Int 1);
("trailingComma", `String "'in objects'");
("andIn", `List [`String "'arrays'"]);
("\"backwardsCompatible\"", `String "\"with JSON\"");
] in
Alcotest.(check yojson_json5) "More elaborated" expected (Yojson_json5.Basic.from_string {|{
// comments
unquoted: 'and you can quote me on that',
singleQuotes: 'I can use "double quotes" here',
lineBreaks: "Look, Mom! \
No \\n's!",
hexadecimal: 0xdecaf,
leadingDecimalPoint: .8675309, andTrailing: 8675309.,
positiveSign: +1,
trailingComma: 'in objects', andIn: ['arrays',],
"backwardsCompatible": "with JSON",
}|})
Of course, this comes with compatibility and performance issues due to the introduction of the locations.
I have programs that ingest a lot of yaml, and have my own model for that that retains a bunch of yaml-native metadata (scalar and block styles, source file locations, etc). However, I also need to interop with lots of existing code that expects yojson, and to report source locations when things go wrong.
The way I've moved forward is to have a to_yojson
function that produces yojson from my yaml model, plus a path-based lookup structure so that, if something goes wrong in e.g. a ppx-generated of_yojson
function, I can recover the original yaml source location for that snippet.
A similar approach might be a reasonable tradeoff for yojson to maintain its own location metadata. The core model could then remain unchanged, with a parallel set of parsing entry points available that returns the usual yojson value, plus a metadata structure. Granted, metadata-related usage would be more clumsy than if it were rolled into the core type, but this approach would preserve compatibility and leave only those interested in that metadata paying for the additional complexity.
Fixed by #152.
Current state & Rationale
Yojson currently supports JSON parsing as defined in RFC 8259 with a couple of extensions: variants, tuples and comments (the latter not documented).
JSON is a format that is derived from JavaScript and since JSON was created a lot of complaints have been raised that the format is suboptimal in some ways and could be improved, amongst these are
@c-cube mentioned that a lot of these assorted sets of improvements were bundled together into a superset of JSON called JSON5.
Proposal
This RFC suggests implementing a parser for JSON5. The advantage of implementing JSON5 is that this format is basically updating JSON to JSON as it is specified in the ES5, thus a commonly accepted standard and not just a loose set of extensions to JSON.
The parser could work completely transparently and just return one of the existing JSON variants, like
Safe.t
since JSON5 can be represented in standard-JSON as well. As such they would use the same pretty-printer. It would also avoid creating yet another variant.This is all in accordance to Postel's law as Yojson would be able to accept JSON5 while being conservative and only output widely-accepted JSON.
Impact
Depending on how it is decided Yojson could become a pure JSON5 parser, thus all the variants would opt in to use the JSON5 parser, so some non-standard JSON that was not parseable before could become parseable. Alternatively, an opt-in approach is also possible where strict RFC 8259 compliance is necessary.
Roadmap
Could happen any time after 2.0.0, though a decision on #104 would be useful to know whether these extensions would be part of a JSON5 parser or not.