plokhotnyuk / jsoniter-scala

Scala macros for compile-time generation of safe and ultra-fast JSON codecs + circe booster
MIT License
750 stars 99 forks source link

Support custom decoders with a more convenient API #1215

Closed lexspoon closed 2 weeks ago

lexspoon commented 1 month ago

I'm experimenting with Jsoniter-scala, and it has gone well in general.

One thing that can be difficult, though, is custom deserializers. With the original Jsoniter, there is a very nice Iterator API. The web site goes through some of the advantages, and I am finding them to be true in practice. Writing a custom deserializer with the JsonReader interface is tedious and error prone. It's especially worrisome that using the JsonReader interface, I'm not completely sure that I'm validating the input to be correct JSON.

UPDATE: I thought about it some more and have a prototype to propose. What do the maintainers think?

The general idea in this prototypye is to have one way to decode each kind of thing that JSON supports. For primitive types, a method is provided to read and return the value. For arrays and objects, the caller provides callbacks for reading the nested elements.

/** Decode a JSON stream one element at a time, rather than one token at a time.
  */
class JsonStructuredReader(jsonReader: JsonReader) {

  /** Signal a decoder error. This simply forwards to [[JsonReader.decodeError()]]. */
  def decodeError(str: String) = jsonReader.decodeError(str)

  /** Read another kind of object with a different decoder */
  def read[T: Manifest](implicit codec: JsonValueCodec[T]): T =
    codec.decodeValue(jsonReader, decodeError(s"Expected a ${implicitly[Manifest[T]]}"))

  /* JSON primitive types. Consume them or throw a format error  */
  def readString: String = {
    jsonReader.readString(null)
  }

  def readNumber: Double = {
    jsonReader.readDouble()
  }

  def readBoolean: Boolean = {
    jsonReader.readBoolean()
  }

  def readNull: Null = {
    if (jsonReader.nextToken() != 'n') {
      decodeError("Expected null")
    }
    jsonReader.readNullOrError("ignored", "expected null")
    null
  }

  /** Read an array. The elements of the array will be decoded with readElement.
    */
  def readArray[T: ClassTag](readElement: => T): ArraySeq[T] = {
    val result = ArraySeq.newBuilder[T]

    if (jsonReader.nextToken() != '[') {
      jsonReader.decodeError("Expected [")
    }

    var first = true
    while (true) {
      val tok = jsonReader.nextToken()
      if (tok == ']') {
        return result.result()
      }

      if (first) {
        first = false
        jsonReader.rollbackToken()
      } else {
        if (tok != ',') {
          decodeError(s"Expected , or ]")
        }
      }

      result += readElement
    }

    throw new RuntimeException("unreachable")
  }

  /** Read an object. For each field seen in the object, the [[readField]] parameter will be invoked and will be
    * expected to consume the value that goes with that field. The caller should have a var for each field that it wants
    * to recognize. Once all fields are read, the [[computeObject]] parameter will be invoked to assemble the vars into
    * a final result object.
    */
  def readObject[T](readField: (String) => Unit, computeObject: => T): T = {
    if (jsonReader.nextToken() != '{') {
      decodeError("Expected {")
    }
    val fieldNames = mutable.HashSet.empty[String]

    var first = true
    while (true) {
      val tok = jsonReader.nextToken()
      if (tok == '}') {
        return computeObject
      }

      if (first) {
        first = false
        jsonReader.rollbackToken()
      } else {
        if (tok != ',') {
          decodeError(s"Expected , or }")
        }
      }

      val fieldName = readString
      if (fieldNames.contains(fieldName)) {
        decodeError(s"Duplicate field $fieldName")
      }
      fieldNames += fieldName

      if (jsonReader.nextToken() != ':') {
        decodeError("Expected :")
      }

      readField(fieldName)
    }

    throw new RuntimeException("Unreachable")
  }

  /** Peek at the next value that is coming up and return what type it is */
  def peek: JsonValueType.JsonValueType = {
    val token = jsonReader.nextToken()
    jsonReader.rollbackToken()
    token match {
      case '[' => JsonValueType.Array
      case 't' | 'f' => JsonValueType.Boolean
      case 'n' => JsonValueType.Null
      case c if c >= '0' && c <= '9' => JsonValueType.Number
      case '{' => JsonValueType.Object
      case '"' => JsonValueType.String
    }
  }
}

/** A type of JSON value */
object JsonValueType extends Enumeration {
  type JsonValueType = Value
  val Array, Boolean, Null, Number, Object, String = Value
}

With this API, I think it is possible to ensure that all decoded objects came from a syntactically valid JSON input stream. Also, this API just looks really convenient to use compared to JsonReader. JsonReader can still exist as an internal API, and it can be directly used by codecs that the macro expands to, but this API looks a lot better for custom codecs written by hand.

plokhotnyuk commented 2 weeks ago

Do not use custom serializer for sum and product types if you don't need an exceptional performance.

Much safer and efficient way is to have a data model that is close to JSON representation as much as possible for easier automated derivation of codecs and Chimney/Ducktape based transformation to your target data model.

lexspoon commented 2 weeks ago

Ah, okay. This is a confusing thing to read given that the phrase "custom codec" appears 8 times on the home page, with statements over and over that custom codecs are feature that should attract you to Jsoniter-scala. Perhaps it is worth updating the home page to say you shouldn't implement custom codecs, that it's just an implementation detail that it's even possible?

I find in practice that you can automatically convert 95% of your case classes to JSON, but that occasionally you want to do something custom. The MetricData and TypeRef examples in this PR are real examples that I ran into while attempting to use Jsoniter-scala at work. I cannot adjust the case classes to match the automatic codecs for those two types as far as I know. For TypeRef, I would have to use a String and not have the marker class. For MetricData, I don't think it's possible at all. These are exceptional cases, but in a large code base, the exceptions do happen once in a while.

A practical toolkit for encoding case classes to JSON will generally need some kind of escape hatch for custom codecs. I'm surprised that's not interesting for Jsoniter-scala, given that the original Jsoniter has it.

plokhotnyuk commented 2 weeks ago

I can help with finding the most safe and efficient solutions for your challenges.

Please open an issue for each of them with expected JSON samples and existing data structures.

lexspoon commented 2 weeks ago

The PR has two examples in the test cases, so please take a look. I'm thinking of MetricData and TypeRef.

In general, I think it will be hard to avoid wanting to ever write a custom decoder. Moreover, I'm not sure why it would be unwelcome to make this process easier. The framework would still have all its other advantages, plus now one more.

plokhotnyuk commented 1 week ago

The custom codec for TypeRef can be written manually without extra wrapping decoder:

implicit val codecOfTypeRef: JsonValueCodec[TypeRef] = new JsonValueCodec[TypeRef] {
  override def decodeValue(in: JsonReader, default: TypeRef): TypeRef = new TypeRef(in.readString(null))

  override def encodeValue(x: TypeRef, out: JsonWriter): Unit = our.writeVal(x.name)

  override def nullValue: TypeRef = null
}

or derived automatically:

implicit val codecOfTypeRef: JsonValueCodec[TypeRef] = 
  JsonCodecMaker.make(CodecMakerConfig.withInlineOneValueClasses(true))

The codec for MetricData can be auto-derived too, just need to add a custom codec for Any values:

implicit val codecOfAny: JsonValueCodec[Any] = new JsonValueCodec[Any] {
    override def decodeValue(in: JsonReader, default: Any): Any = {
      val t = in.nextToken()
      if (t == 't' || t == 'f') {
        in.rollbackToken()
        in.readBoolean()
      } else if (t >= '0' && t <= '9' || t == '-') {
        in.rollbackToken()
        in.readDouble()
      } else if (t == '\"') {
        in.rollbackToken()
        in.readString(null)
      } else {
        in.readNullOrError(default, "expected boolean, numeric, string, or null values")
      }
    }

  override def encodeValue(x: Any, out: JsonWriter): Unit = 
    x match {
       case b: Boolean => out.writeVal(b) 
       case d: Double => out.writeVal(d) 
       case s: String => out.writeVal(s)
       _ => out.writeNull() 
    }

  override def nullValue: Any = null.asInstanceOf[Any]
}

implicit val codecOfMetricData: JsonValueCodec[MetricData] = JsonCodecMaker.make[MetricData]

The better option would be using some sum-type instead of Any. It could be modeled with a sealed trait and case classes that extends it, or in Scala 3 you can use new enums or union types.

lexspoon commented 1 week ago

I agree that it can be done, and that looks like a clean solution using the JsonReader API. I wrote something similar to start with. It took a lot of time and some false starts, but I eventually got it to work. I don't think I found readNullOrError, so your version has one less "else if", which is a tidy improvement.

The version on this page solves a little bit simpler problem, though. I believe this version will decode input like {"data": [[1,2], [3,4]]}, won't it? To make it a full apples to apples comparison, consider trying a decoder for [[1,2], [3,4]].

Either way, I'd encourage you to write the same decoder using JsonStructuredReader rather than JsonReader. I'm wondering if you would agree that, with the help of the wrapper, writing a decoder becomes easier.

Here are some problems the wrapper solves compared to the above code:

In the fuller version that also decodes the arrays, there are issues with commas and brackets, and with allocating builders to accumulate the data. These go away in JsonStructuredReader as well.

None of these issues prevent a decoder from being implemented, and I know that custom decoders aren't the main thing of the framework. It just seems better if the framework can make this case easier, saving the developer's mental bandwidth for other things. And this is just one example, by the way. I just checked out of curiosity, and the main codebase I work in right now has about 70 custom Spray decoders.

Re a sum type, I agree that it's cleaner that way. Also, there's a similar trade-off for the JSON encoding. In both cases, if the data can be large, you might want to have a more compact encoding even though it's not as clean.