ocaml-community / yojson

Low-level JSON parsing and pretty-printing library for OCaml
https://ocaml-community.github.io/yojson/
BSD 3-Clause "New" or "Revised" License
329 stars 58 forks source link

Parsing locations #159

Open ejgallego opened 1 year ago

ejgallego commented 1 year ago

Dear yojson devs,

I'm reading some JSON data that I then analize, however, as to emit correct source locations for the analysis I'd need that the parsed JSON would contain locations on the original input file / buffer.

Is that something possible to do with yojson?

TIA!

mjambon commented 1 year ago

Historically, yojson was focused on speed and that's why the tree contains the data and nothing else. Getting a proper AST with locations and more is useful for some applications. Having this in yojson is up to the maintainers but it would be an addition that doesn't have much in common with the existing implementation.

With some extra preprocessor hacking, we may be able to reuse the existing lexer which currently uses a hack to disable location tracking to improve performance. I'm not sure what others think.

There's also the YAML path, which would consist in finding a YAML parser that returns a precise AST and has an option to restrict itself to JSON (modern YAML being a strict superset of JSON IIRC). I don't know much about the possibilities of ocaml-yaml but it may be worth looking into it.

ejgallego commented 1 year ago

Thanks for the feedback @mjambon , indeed location tracking can decrease efficiently greatly, good point.

I guess it makes sense to keep this issue open to see if a solution could be found, for now, for my own hacking purposes I derived a JSON parser from RWC that does indeed parse with locations, sharing it here in case could be useful to others:

https://github.com/ejgallego/coq-lsp/pull/306

abishekvashok commented 1 year ago

Hey, yes, having location can greatly help my cause as well. Would appreciate adding that capability like a switch if possible. I guess all it needs is to uncomment a couple of lines in read.mll ?

I will be happy to make a patch if the maintainers are receptive of it :)

Leonidas-from-XIV commented 1 year ago

The problem is how to represent this in the type. Can't really change the AST without breaking every single user of Yojson, so a new AST would need to be created.

In such case, it would be useful to revisit the structure of the AST as it is now anyway (see #158), to not have to do this in the future… this turns into a whole rabbit hole of issues so it's not particularly easy to tackle I would say without creating a lot of issues and questions as @mjambon mentions.

abishekvashok commented 1 year ago

Can't we reuse the existing lexer, https://github.com/ocaml-community/yojson/blob/cf2f13097265b2644686bc83183e227219398ac7/lib/read.mll#L2-L23

I don't know what all needs to be done after uncommenting those lines, and don't no whether conditional include or overriding is permitted in Ocaml so as to make this a switch.

At least some sort of support, like to have line numbers while matched, would be cool

Leonidas-from-XIV commented 1 year ago

You probably can, as the locations are in there so that's not the problem. The main issue is how to attach line numbers to the AST.

You could use a similar approach as opam-file-format, wrapping the current AST into with_pos, but this adds a lot of intermediate records and allocations:

https://github.com/ocaml/opam-file-format/blob/303ab85afb67c7c22ea548b87241b08616b9f6f2/src/opamParserTypes.ml#L59-L62

gcluzel commented 7 months ago

Having locations embedded directly within the AST would really be beneficial in scenarios where the JSON file originates directly from a user. In such cases, prioritizing parsing efficiency alone seems inefficient when compared to the enhanced utility that could be gained from parsing locations, allowing for more informative messages to be displayed to the user.

What about creating new ASTs that store the locations directly in the nodes, more or less like that:

type pos = Lexing.position * Lexing.position
type t = [
  | `Null of pos
  | `Int of int * pos
  | `Assoc of (string * t * pos) list * pos
  ... ]

Would it make sense?

Leonidas-from-XIV commented 7 months ago

Such an AST would work (though I'd probably define the type as (t * pos) to not have to write pos on every constructor), the issue is that it is a completely new AST so it would be yet another type that very few libraries use while requiring implementation and maintenance work for a comparatively small improvement.

Given the amount of active maintainers in this project is very low and the demand is sporadic I have a hard time seeing this move forward anytime soon.

ejgallego commented 7 months ago

I guess more than seeing the improvement as "small", it is more in the category of "critical for some users" vs "non important for others", for example if you need to parse user-provided JSON , locations are certainly a must if you need to provide feedback.

Maybe indeed such parsing library would make sense as an independent project? I'm happy to fork the lib I wrote for coq-lsp and make it into a (Yo)json parser that is better suited for interactive use (error recovery, locations, etc...)

Leonidas-from-XIV commented 7 months ago

Agreed, you do make a good point. It's probably better to think of it an inessential improvement for current users of Yojson that e.g. value API stability, interop and speed vs. an essential improvement for people who need the error reporting.

The problem I see here is that extending the current parser to save locations would make parsing slower due to a lot more allocations and making that conditional would make the parser a good deal harder to maintain. Creating another parser solves this, but creates another issue of compliance - we probably don't want to have a parser with locations that parses data one way and a parser without locations that parses the same data another way.

However, if someone would want to step up as a co-maintainer to make sure a parser with locations and an extended AST that is identical to, say Yojson.Safe.t with the locations stripped, I think that could be a good compromise and even allows people to use the parsed JSON data structure in libraries that requires Yojson.Safe.t. With e.g. QCheck we have the tooling for it, so it's more a question of developer-time.

ejgallego commented 7 months ago

Indeed!

While I'm not expert on parsing, it seems to me that a parser geared towards user input is very different than a parser geared towards de-serialization. In particular locations are not enough, you also want to recover properly which makes the parser harder (tho easy to do with modern tools like the amazing menhir)

So that would be a different project, maybe in a different repository.

I also agree that testing for compliance would be important, but I think it should be easy to write a test driver that parses the test-cases twice and indeed check they are identical to the yojson reference parser.