nim-lang / RFCs

A repository for your Nim proposals.
137 stars 23 forks source link

Next Gen direct JSON deserializer for the stdlib #384

Closed planetis-m closed 3 years ago

planetis-m commented 3 years ago

Moving away from black or white solutions, where a user must choose either speed or correctness, we can set the ambitious goal to adhere to the JSON standard, support Nim's language constructs and maintain good performance. It is actually entirely within our reach.

Duplicate keys

First things first, is handling duplicate keys in JSON.

Consider the JSON object {"shared":1,"kind":"Apple","apple":"world","kind":"Banana"} as an example.

With the current stdlib design, a Table[string, Option[JsonNode]] is needed to track which key is already encountered (it is not implemented). This is expensive in regards to memory and performance.

A direct deserializer has the knowledge of all the valid object fields and can generate optimal code like this:

type
  Bar = object
    shared: int
    kind: Fruit
    bad: float
    banana: int
    apple: string

var
  dst: Bar
  onceBanana = false
  ... # more bools for every key
eat(p, tkCurlyLe)
while p.tok != tkCurlyRi:
  if p.tok != tkString:
    raiseParseErr(p, "string literal as key")
  case p.a
  of "banana":
    discard getTok(p)
    eat(p, tkColon)
    if not onceBanana:
      onceBanana = true
      initFromJson(dst.banana, p)
    else:
      when defined(emiDuplicateKey):
        skipJson(p)
      else: raiseParseErr(p, "duplicate object field")
  ... # More code to handle all keys
  else:
    raiseParseErr(p, "valid object field")
  if p.tok != tkComma:
    break
  discard getTok(p)
eat(p, tkCurlyRi)

The user can choose either to skip or error on duplicate keys, both spec compliant behaviors. A working benchmark shows this technique achieves about 10% better performance on JSON objects without duplicate keys. On objects with duplicate keys (not a common case) it would vary a lot but it would be even faster.

Object variants

Another thorn of high performance JSON deserializers is object variants. My current implementation for example, expects the discriminator field first else it errors. Another possible solution is backtracking, implemented by jsony and nim-json-serialization. It results in unpredictable performance, so it should be avoided IMO.

To efficiently support them, we can build on top of the previous solution. The algorithm is as follows. Every time we encounter a field belonging to another variant we check if we should error or make a transition. If any field, that does not exist in the next variant, is already encountered, throw an error. Else continue, coping the common fields over to the new object. This way we dont have to maintain a temporary for each and every field as that would blow our space requirements.

The linked example is as complex as it needs to be to prove its feasibility. It still maintains about the same performance edge over the simplest solution of expecting the discriminator field first.

Runtime types

The problem. Consider this object hierarchy:

type
  Component = ref object of RootObj
  Move = ref object of Component
    speed: float32
  Transform = ref object of Component
    rotation: float32
var
  output: seq[Component]

JSON would look like [{"speed":20},{"rotation":0.0}]. A deserializer processing this input has no idea about the runtime type of each array element thus it either errors eminim or results on data loss jsony.

Following the same logic as before this example can also be supported. I only have a rough outline of my idea right now, but I can prepare a POC.

Writing custom deserializers

This is also problematic because the user is entirely on their own and it's more complex than say https://github.com/nim-lang/RFCs/issues/247 A lot can go wrong especially in regards to the order of keys encountered. This proposal does not cover this use case.

How do others make it?

Rust has invented its own format that can work with sum types. Basically an object is prefixed by its type, like Banana{}. I am not aware of others.

All the logic is figured out, all is needed an implementation. Please consider funding my work.

Lastly a note about proposal https://github.com/nim-lang/RFCs/issues/368. It will complicate significantly the construction of this macro and any other macro that works on object variants.

Araq commented 3 years ago

Lastly a note about proposal #368. It will complicate significantly the construction of this macro and any other macro that works on object variants.

Very interesting!

planetis-m commented 3 years ago

first idea is good, rest are overcomplicated.