plokhotnyuk / jsoniter-scala

Scala macros for compile-time generation of safe and ultra-fast JSON codecs + circe booster
MIT License
747 stars 98 forks source link

Integrating Jsoniter with Tapir-Pickler #1098

Open kciesielski opened 1 year ago

kciesielski commented 1 year ago

I'm exploring the integration of Jsoniter with tapir-pickler, aiming to streamline the derivation of JSON codecs and schemas for Tapir users through a unified method. Currently, Tapir requires separate derivation and customization of schemas, with manual alignment needed between these schemas and JSON codecs. This process often involves customizing field names, encoding for coproducts or enums, etc.

Our current setup uses Scala 3 and uPickle as the primary codec library. However, uPickle's design makes the code quite messy and hacky, it also hinders our ability to implement some desired features at all. We're considering Jsoniter, its CodecMakerConfig would make some parts of tapir-pickler much much simpler and more maintainable.

We have a few specific issues we'd like to address with Jsoniter:

Inline Configuration CodecMakerConfig requires inline parameters, restricting dynamic configuration. We aim to enable users to create custom field name mappings outside of case classes. Tapir supports any SName -> String function, but Jsoniter's inline configuration demands compile-time known values, creating a mismatch with our dynamic code generation approach.

Nested name mappings It would be also great to allow field name mapping for nested structures, like person.name, but not root name.

Default values Similar to field name mapping, setting default values through configuration would be needed. Currently we need to support Tapir @default annotation on case class fields, but we'd also like to allow users to define such mappings outside of model classes, which sometimes need to stay untouched.

Discriminator value Tapir schema configuration allows customizing discriminator values also using class field values. With Pickler this would look like:

val pPerson = Pickler.derived[Person]
val pOrganization = Pickler.derived[Organization]
val pickler: Pickler[Entity] =
  Pickler.oneOfUsingField[Entity, String](_.kind, _.toString)
    ("person" -> pPerson, "org" -> pOrganization)

(this can probably be simplified to val pickler: Pickler[Entity] = Pickler.oneOfUsingField[Entity, String](_.kind, _.toString) with some extra restrictions, but, but I haven't explored that yet). Jsoniter's adtLeafClassNameMapper is a String=>String function. Due to inline nature of the configuration mentioned earlier it can't be more dynamic, so it's another feature we can't implement now.

If I understand correctly, the inline configuration may be crucial to the idea of Jsoniter being a macro that uses as much compile-time data as possible to generate "hardcoded" values where possible and achieve great performance. Maybe there's a way to reconcile that we the ambitions of tapir-pickler, so I'd really appreciate all the feedback we can get on mentioned points. If needed, I can split it into more fine-grained tickets, but let's start with a general discussion first.

plokhotnyuk commented 1 year ago

As I understand tapir-pickler in early development and using stage, and is going to evolve a lot in the near future while using Scala 3 only.

On the other side jsoniter-scala-macros is a matured it terms of user features and performance of implementations, and it won't change anything radically keeping binary and source compatibility for Scala 2 and Scala 3.

I think that a good option would be just borrowing the implementation of codec generation from jsoniter-scala-macros to tapir-pickler and make own derivation without dependency on jsoniter-scala-macros.

As an example, on initial stage of development jsoniter-scala was considered as a replacement for jackson-scala-module and it was picked to encode type discriminators as separated fields by default. But JSON's ability to place the type field in the end of object leads to redundant overhead in runtime for double scanning of the parsed object labeled by type field. The runtime overhead is getting worse in case of nested objects with such discriminators, and for recursive data structures (which are themself could be sources of stack overflow exceptions) it is considered as a security vulnerability over DoS attacks in case of untrusted input allowed. That is why in the default derivation configuration of jsoniter-scala-macros it is disabled to use recursive data structures and to parse type discriminator field in the non-first position.

The tapir-pickler can solve similar kind of issues differently (as an example by providing an input length limit), so it's features would evolve easily without stuck on jsoniter-scala-macros limitations.

adamw commented 1 year ago

@plokhotnyuk thanks for the answer - really interesting to learn about some of the design choices that you've made in jsoniter. Despite being users of various JSON libraries in Scala, I think now that we're approaching the "other side", we're often learning of various choices and rationale their authors made to make things safe & efficient (the discriminator & first position being a great example of something I didn't really consider before, I suppose it might be similar for @kciesielski).

So on one hand essentially forking jsoniter-scala-macros would give us great flexibility (and that's definitely an option, especially since you suggested it), on the other we'd like to leverage as much of the ongoing expertise that is going into a library such as jsoniter - so we're trying to reuse as much as possible.

That said, if we wanted to develop a more dynamically-configured version of jsoniter, can you see any negative implications in terms of safety or efficiency?

In particular, as @kciesielski mentioned, we'd like to dynamically configure the field name mappings, default values etc.

Also, as pickler currently uses the recursive-derivation approach (that is, a pickler for a case class is assembled from implicitly looked up picklers for all child fields - which in turn, might trigger derivation, in "generic-auto" mode), do you think this would be feasible when using jsoniter? I'm not sure what kind of benefits jsoniter gets from deriving the whole tree in one macro invocation.

(We might also switch to the jsoniter model, as you noted, pickler is in an experimental stage, but we'd like to consider our options before doing that.)

plokhotnyuk commented 1 year ago

The smithy4s-json project uses a kind of derivation of codecs in runtime (see implicit def deriveJsonCodec[A: Schema]: JsonCodec[A] of smithy4s.json.Json).

They have requirements to have JSONPath-like reporting of errors and mapping names for fields. It has big impact on performance due to redundant allocations for field name strings and auxiliary lists that track current position of mapping. Also, it costs an additional latency on startup of services to interpret Schema mappings and derive codecs.

Please, see results of benchmarks that compare jsoniter-scala-macros vs. smithy4s-json for real-world message samples like TwitterAPI, GeoJSON, OpenRTB, GoogleMapsAPI, and GitHubActionsAPI using 1 thread and 16 threads for different JVMs and browsers.

Also, please note that benchmarks do parsing from byte array representation and back. Performance can drop greatly when String is used for input or output.

The latest RFC for JSON requires to use UTF-8 for open systems, so it worth to add ability of working with byte arrays/buffers immediately to buy more efficiency for tapir in runtime.

kciesielski commented 1 year ago

Thanks for the insights, @plokhotnyuk. As Adam suspected, I wasn't aware of the nuances you mentioned. Looks like building all of our desired features on top of jsoniter-scala requires a fork with some non-trivial rework, resulting in performance cost. I'll look into smithy4s-json and we'll decide what path forward is reasonable.