sksamuel / avro4s

Avro schema generation and serialization / deserialization for Scala
Apache License 2.0
714 stars 236 forks source link

Schema derivation yields "Unknown datum class" w/ nested classes / scalapb enums as defaults #826

Open chollinger93 opened 4 months ago

chollinger93 commented 4 months ago

Schema derivation yields "Unknown datum class" w/ nested classes / scalapb enums as defaults

Error

I'm getting a Unknown datum class: ExampleEnumEvent$Action$Undefined$ error when trying to derive a schema for a scalapb generated enum.

More generally, this can be reproduced with any nested structure (see below).

Similar to #677?

Minimal Protobuf Example

// pb/types.proto
package pb;

message ID {
  string id = 1;
}

// example.proto
syntax = "proto3";

import "pb/types.proto";

message ExampleEnumEvent {
  pb.ID id = 1;
  Action action = 2;
  enum Action {
    Undefined = 0;
    Allow = 1;
    Deny = 2;
  }
}

Package settings preserve_unknown_fields: false, lenses: false.

Yields

@SerialVersionUID(0L)
final case class ExampleEnumEvent(
    id: _root_.scala.Option[pb.types.ID] = _root_.scala.None,
    action: ExampleEnumEvent.Action = ExampleEnumEvent.Action.Undefined
    ) extends scalapb.GeneratedMessage {

Action is a

sealed abstract class Action(val value: _root_.scala.Int) extends _root_.scalapb.GeneratedEnum 

Test

import com.sksamuel.avro4s.{SchemaFor, ToRecord, Encoder as AvroEncoder}

val e = ExampleEnumEvent(
      id = Some(ID("1")),
    )
type T = ExampleEnumEvent
val schema = AvroSchema[T]
println(schema.toString(true))
val enc = AvroEncoder[T]
val toRecord: ToRecord[T] = ToRecord[T](schema)(using enc)
val gen = toRecord.to(e)
println(gen)

Which gets us

Unknown datum class: class ExampleEnumEvent$Action$Undefined$
org.apache.avro.AvroRuntimeException: Unknown datum class: class ExampleEnumEvent$Action$Undefined$
    at org.apache.avro.util.internal.JacksonUtils.toJson(JacksonUtils.java:96)
    at org.apache.avro.util.internal.JacksonUtils.toJsonNode(JacksonUtils.java:53)
    at org.apache.avro.Schema$Field.<init>(Schema.java:598)
    at com.sksamuel.avro4s.schemas.Records$.buildSchemaField(records.scala:89)
    at com.sksamuel.avro4s.schemas.Records$.$anonfun$1(records.scala:31)
    at scala.collection.immutable.List.flatMap(List.scala:293)
    at com.sksamuel.avro4s.schemas.Records$.schema(records.scala:32)
    at com.sksamuel.avro4s.schemas.MagnoliaDerivedSchemas.join(magnolia.scala:14)
    at com.sksamuel.avro4s.schemas.MagnoliaDerivedSchemas.join$(magnolia.scala:10)
    at com.sksamuel.avro4s.SchemaFor$.join(SchemaFor.scala:55)

W/o proto

This has functionally the same effect:

final case class Nested(s: String = "foo", n: Nested.Nest = Nested.Undefined())
object Nested {
  sealed abstract class Nest(i: Int)
  final case class Undefined() extends Nest(-1)
  final case class N(i: Int) extends Nest(i)
}

Validation / Workaround

If we set no_default_values_in_constructor (or remove the default), it works and yields:

{
  "type" : "record",
  "name" : "ExampleEnumEvent",
  "namespace" : "test",
  "fields" : [ {
    "name" : "id",
    "type" : [ "null", "string" ]
  }, {
    "name" : "action",
    "type" : [ {
      "type" : "record",
      "name" : "Allow",
      "namespace" : "ExampleEnumEvent.Action",
      "fields" : [ ]
    }, {
      "type" : "record",
      "name" : "Deny",
      "namespace" : "ExampleEnumEvent.Action",
      "fields" : [ ]
    }, {
      "type" : "enum",
      "name" : "Recognized",
      "namespace" : "ExampleEnumEvent.Action",
      "symbols" : [ "Undefined", "Allow", "Deny" ]
    }, {
      "type" : "record",
      "name" : "Undefined",
      "namespace" : "ExampleEnumEvent.Action",
      "fields" : [ ]
    }, {
      "type" : "record",
      "name" : "Unrecognized",
      "namespace" : "ExampleEnumEvent.Action",
      "fields" : [ {
        "name" : "unrecognizedValue",
        "type" : "int"
      } ]
    } ]
  } ]
}

Alternatively, explicitly setting

val e = ExampleEnumEvent(
      id = Some(ID("1")),
      action = ExampleEnumEvent.Action.Undefined
    )

Has the same effect (which of course isn't viable for events that come in from another service).

I can't do a given SchemaFor[ExampleEnumEvent.Action] = SchemaFor[ExampleEnumEvent.Action], since that causes a StackOverflowError, since I suppose that causes infinite recursion at runtime.

I've also tried tricking avro4s into treating the scalapb.GeneratedEnum as a Enumeration type by defining a trait that extends from Enumeration, but to no avail.

Other

On a side note, compilation time for scalapb generated objects that include a val of AvroSchema[A] are very, very long, presumably since the generated scalapb classes are rather large (up to ~10s/class). Scala 3 doesn't have any good compiler profilers, as far as I'm aware, so I'm not 100% sure where exactly that happens.

But I figured I'd mention that here, since I'm not sure if that's expected.

Environment

22.04.1-Ubuntu, avro4s 5.0.9, Scala 3.4.0