plokhotnyuk / jsoniter-scala

Scala macros for compile-time generation of safe and ultra-fast JSON codecs + circe booster
MIT License
750 stars 99 forks source link

Add a more foolproof way to implement custom decoders #1219

Closed lexspoon closed 2 weeks ago

lexspoon commented 2 weeks ago

This PR implements issue https://github.com/plokhotnyuk/jsoniter-scala/issues/1215 .

I put the code in "shared", but I was only able to run the tests for the JVM target. For the native target, I get an error about gc.h. For the JS target, it spends over 10 minutes downloading NPM stuff and then aborts. Perhaps someone else can run the tests on those targets, and/or perhaps there's a builder somewhere that can do so?

What this code does is allow you to write a custom decoder by reading a whole value at a time rather than one token at a time. I have found the token-at-a-time decoding to be very error prone and not something I can recommend for my coworkers to use; with that approach, I'm frequently not even sure that I am correctly detecting malformed JSON. However, the approach with one value at a time seems top-of-breed. The decoding API in this PR is inspired by Jsoniter for Java as well as the Circe HCursor decoders, but it has some tricks that I think make it competitive and in some ways better than those.

The general approach is to decode an entire value at once. So, there are entry points in the API for all of the types of values that JSON supports. For arrays and objects, there are sub-elements that need to be parsed, and for those, the caller supplies a callback. With Scala's syntax, these callbacks are usually very concise to write. Take a look at the MetricData example in this PR's test case.

Compared to Jsoniter for Java, the utility in this PR requires no registry of decoders. Instead, recursive decoders are invoked using implicit values, just like in the rest of Jsoniter for Scala.

Compared to HCursor, this API is optimized for doing a full decode of a JSON string into some Scala object. This API assumes you will want to ook at every field of every object, and every element of every array, so you don't have to call things like downField as much. However, HCursor is better if you don't want to fully decode the JSON srting and rather just pick out a small portion of the values within it.

Regardless of comparisons to the field, though, the main purpose here is to shore up a part of Jsoniter-scala that is currently problematic. The name "jsoniter" comes from the convenient decoding API ("JSON iterator"), but Jsoniter-scala currently doesn't include such a thing!

plokhotnyuk commented 2 weeks ago

Parsing JSON in Scala is a minefield.

Please use approach recommended in my first comment for your issue to avoid introducing a lot of new ones.

lexspoon commented 2 weeks ago

I'm not sure if I understand the workaround that's being recommended. Are you suggesting that I do a two-step encoding of first converting the JSON to something like Circe's Json type, and then decoding that? I considered that approach, but it seems to undermine the idea of Jsoniter if I make an intermediate data structure and throw it away. Maybe it is a different workaround being suggested, and I don't understand yet.

The version in this PR seems both small and safe, to me. The decoder, as you can see in the test case, is just 15 lines long. Do you see an issue with this approach if we go forward with it at my work? We have had a good experience so far, but your comments indicate that we're likely to have all these problems. If you can describe the likely problems we are going to run into, that may help us course correct in a way to avoid the issues. It surprises me, though, given our positive experience. I was thinking to share it back and to shore up an important problem in Jsonitor-scala.

If more context would help, the data size can indeed be large, for this particular type, so the encoding speed can be potentially significant. Additionally, sometimes we read JSON messages while debugging--in fact, JSON is often easier to read than the Scala case class printouts--and the JSON encoding needs to be reasonable for that to work.

So, the context is for passing features data around through AI infrastructure. Our company has a one-row version of this data type, with a custom codec in Spray. I'm looking at the multi-row version of the problem, and I'm evaluating alternatives to Spray. As I evaluate, the ability to write a custom codec is one of the things we care about. In general, this data could have > 1000 rows and > 10 columns of data, mybe > 50 columns of data in some cases. An automatic encoder is not deduced for this type, because it doesn't know what the type Any can potentially be. If I define a sum of products to replace the Any, then it will blow up both the in-memory representation as well as the auto-deduced JSON encoding. I can potentially switch the case class to be column-oriented instead of row-oriented. That would make the JSON be reasonable, but it would come at the expense of the in-memory layout being wrong for the way our code normally accesses the data. So, all in all, it seemed like a good situation for a custom codec. I found it really terrible to write that custom codec using the JsonReader API, however. You have to do things like read an initial "n" from "null" and then call a utility to read the "ull". It's just not friendly in its current form, but I think it can be friendly while also retaining the other design goals.

As an important side benefit of the custom encoder, I really like Jsoniter-scala's error messages, how they include the position where the error occurred. The framework can do this because it is processing the JSON directly rather than an AST. I wouldn't be surprised if my company has spent > 100 hours if you add up all the times someone has debugged an error from Spray about a JSON decoding error, but without saying where in the a huge JSON blob the error occurred. I believe this benefit would go away if I do a two-step decode to Json and then to the final type.

Stepping back, I'm surprised in general by the strength and tenor of these comments. A custom codec seems like an ordinary thing to want with this kind of framework, and we have several of them in our code base right now. I've spent 10-20 hours now on experimenting with Jsoniter-scala as a possible technology recommendation.

plokhotnyuk commented 1 week ago

I need to write more docs with tutorials and how-tos for users to reduce time of learning and searching for custom solutions.

For now please skim through tests in the jsoniter-scala-macros modules to grasp understanding of compile time configuration options and injection of custom codecs using implicit val/def or given definitions in the scope of derivation with JsonCodecMaker.make calls.