ocaml-community / yojson

Low-level JSON parsing and pretty-printing library for OCaml
https://ocaml-community.github.io/yojson/
BSD 3-Clause "New" or "Revised" License
326 stars 58 forks source link

[RFC] Support for JSON5 parsing #106

Closed Leonidas-from-XIV closed 3 months ago

Leonidas-from-XIV commented 4 years ago

Current state & Rationale

Yojson currently supports JSON parsing as defined in RFC 8259 with a couple of extensions: variants, tuples and comments (the latter not documented).

JSON is a format that is derived from JavaScript and since JSON was created a lot of complaints have been raised that the format is suboptimal in some ways and could be improved, amongst these are

@c-cube mentioned that a lot of these assorted sets of improvements were bundled together into a superset of JSON called JSON5.

Proposal

This RFC suggests implementing a parser for JSON5. The advantage of implementing JSON5 is that this format is basically updating JSON to JSON as it is specified in the ES5, thus a commonly accepted standard and not just a loose set of extensions to JSON.

The parser could work completely transparently and just return one of the existing JSON variants, like Safe.t since JSON5 can be represented in standard-JSON as well. As such they would use the same pretty-printer. It would also avoid creating yet another variant.

This is all in accordance to Postel's law as Yojson would be able to accept JSON5 while being conservative and only output widely-accepted JSON.

Impact

Depending on how it is decided Yojson could become a pure JSON5 parser, thus all the variants would opt in to use the JSON5 parser, so some non-standard JSON that was not parseable before could become parseable. Alternatively, an opt-in approach is also possible where strict RFC 8259 compliance is necessary.

Roadmap

Could happen any time after 2.0.0, though a decision on #104 would be useful to know whether these extensions would be part of a JSON5 parser or not.

dhilst commented 2 years ago

Hello, I would like to help. I guess we want to extend the current parser to support JSON5 right? I will take a look at the code

c-cube commented 2 years ago

yeah, I think the idea would be to have a separate AST (and parser?) for JSON5 :). The AST is richer because there's things like datetimes, if I recall correctly.

dhilst commented 2 years ago

Oh okay, I will try it and open a PR when get something

Leonidas-from-XIV commented 2 years ago

Yes, you'd need a separate AST. And last time I looked at it we used a separate parser because for lexing it made a lot of sense to use sedlex as you need to match on all the character classes that the spec wants you to support: https://github.com/ocaml-community/yojson/compare/master...issuu:yojson:json5-parser-implementation

dhilst commented 2 years ago

I'm tagging my self so I got a easy link to this @dhilst

dhilst commented 2 years ago

I see that @issuu started to work on this, should I get from where issuu stopped?

Leonidas-from-XIV commented 2 years ago

@dhilst It is probably a decent point to start but I am not sure how far they got.

mjambon commented 2 years ago

I don't think we necessarily need a new AST type for JSON5. The short example showing the JSON5 extensions from https://json5.org/ is:

{
  // comments
  unquoted: 'and you can quote me on that',
  singleQuotes: 'I can use "double quotes" here',
  lineBreaks: "Look, Mom! \
No \\n's!",
  hexadecimal: 0xdecaf,
  leadingDecimalPoint: .8675309, andTrailing: 8675309.,
  positiveSign: +1,
  trailingComma: 'in objects', andIn: ['arrays',],
  "backwardsCompatible": "with JSON",
}

As with legacy JSON, it's not clear what precision must be supported on numbers. It would be good to take a look at what the JavaScript implementations do with hexadecimal literals, since the default number representation uses 64-bit floats, with only 52 bits usable for integers. The OCaml world isn't much better since the int type accommodates either 31 or 63 bytes depending on the platform. I suspect this makes hexadecimal constants nearly unusable in practice, so we might want to introduce a new Hex variant (Hex of int64?), much like we have already Int and Float rather than just Number.

c-cube commented 2 years ago

The hexadecimal floats are well supported in OCaml:

# 0xdecaf. ;;
- : float = 912559.

There's even a notation to go with it for hex exponents!

# 0xdecafp42;;
- : float = 4.01347692612655514e+18

But imho, the potential problem with json5 is that we may want to preserve comments (to parse a config file + reprint it). Perhaps that's too ambitious. If that's out of scope then the current AST is fine.

dhilst commented 2 years ago

Can we extend the current AST adding comments? It's not an API breaking change it is it?

c-cube commented 2 years ago

Modifying the main internal AST, perhaps not. If you modify one of the main types (like Yojson.Safe.t or Yojson.Basic.t it is most definitely a big breaking change. Lots of libraries (and code) explicitly pattern match on it :)

dhilst commented 2 years ago

I see, yeah so it would be better to have a new AST, it should be simple to write a function from new AST to the old, it will not be total though

dhilst commented 2 years ago

Hello I have a question, I was trying to preserve the comments when I run into this case {/*foo*/}, what would be the AST for this case? (I created a new type for the AST adding `Comment of string into it to preserve the comments).

A comment may appear in these positions inside an object and an array, (the array case is easy to handle I don't have to modify the list constructor, but for the object (and the `Assoc constructor) I don't have a good idea, generalizing it to keep the comments (in their original position) would make the user need to handle a comment in every position it seems annoying to me


{ /* comment */ foo /* comment */ : /* comment */ "bar"  /* comment */} // comment
[/* this is the easy case */]
dhilst commented 2 years ago

I think we should drop the comments from the AST by now and use the old AST as this would simplify most of the changes

Leonidas-from-XIV commented 2 years ago

I agree that there is no good way to have it in the current AST without making it an extremely convoluted structure.

mjambon commented 2 years ago

If the location of the JSON nodes was kept around, we could collect the comments in a separate list together with their location. Then, a hypothetical pretty-printer could choose to reinsert comments at the correct location. The AST type would look like this:

type t = [ 
| `Null of loc
| `Bool of (loc * bool)
| `Int of (loc * int)
| `Intlit of (loc * string)
| `Float of (loc * float)
| `String of (loc * string)
| `Assoc of (loc * (string * t) list)
| `List of (loc * t list)
]

type full = {
  json : t;
  comments : (loc * string) list;
}

The type of a parsing function would be something like this:

val from_string : string -> full

Of course, this comes with compatibility and performance issues due to the introduction of the locations.

dhilst commented 2 years ago

I would like to keep the old AST to simplify the PR diff, so we can focus on having parsing JSON5, then we can add other things on top of the new parser. I have some tests passing, I'm doing minor tweaks (like not preserving the quotes in the strings), also I was thinking about supporting the Safe interface first and then adding the Raw and Basic (I'm using the Basic for now)

A passing test example

    let expected = `Assoc [
      ("unquoted", `String "'and you can quote me on that'");
      ("singleQuotes", `String "'I can use \"double quotes\" here'");
      ("lineBreaks", `String ({|"Look, Mom! \
No \\n's!"|}));
      ("hexadecimal", `Int 0xdecaf);
      ("leadingDecimalPoint", `Float 0.8675309);
      ("andTrailing", `Float 8675309.0);
      ("positiveSign", `Int 1);
      ("trailingComma", `String "'in objects'");
      ("andIn", `List [`String "'arrays'"]);
      ("\"backwardsCompatible\"", `String "\"with JSON\"");
    ] in
    Alcotest.(check yojson_json5) "More elaborated" expected (Yojson_json5.Basic.from_string {|{
  // comments
  unquoted: 'and you can quote me on that',
  singleQuotes: 'I can use "double quotes" here',
 lineBreaks: "Look, Mom! \
No \\n's!",
  hexadecimal: 0xdecaf,
  leadingDecimalPoint: .8675309, andTrailing: 8675309.,
  positiveSign: +1,
  trailingComma: 'in objects', andIn: ['arrays',],
  "backwardsCompatible": "with JSON",
}|})
cemerick commented 1 year ago

Of course, this comes with compatibility and performance issues due to the introduction of the locations.

I have programs that ingest a lot of yaml, and have my own model for that that retains a bunch of yaml-native metadata (scalar and block styles, source file locations, etc). However, I also need to interop with lots of existing code that expects yojson, and to report source locations when things go wrong.

The way I've moved forward is to have a to_yojson function that produces yojson from my yaml model, plus a path-based lookup structure so that, if something goes wrong in e.g. a ppx-generated of_yojson function, I can recover the original yaml source location for that snippet.

A similar approach might be a reasonable tradeoff for yojson to maintain its own location metadata. The core model could then remain unchanged, with a parallel set of parsing entry points available that returns the usual yojson value, plus a metadata structure. Granted, metadata-related usage would be more clumsy than if it were rolled into the core type, but this approach would preserve compatibility and leave only those interested in that metadata paying for the additional complexity.

Leonidas-from-XIV commented 3 months ago

Fixed by #152.