sirthias / borer

Efficient CBOR and JSON (de)serialization in Scala
https://sirthias.github.io/borer/
Mozilla Public License 2.0
220 stars 13 forks source link

Key collision error in ADTs with type bounds #685

Closed raquo closed 8 months ago

raquo commented 8 months ago

Hello,

I'm getting a compiler error which seems unintentional. The following code:

import io.bullet.borer.*
import io.bullet.borer.derivation.MapBasedCodecs.*
import io.bullet.borer.derivation.key

sealed trait Bar {
  def toEither: Either[Int, String] = {
    this match {
      case IntBar(int) => Left(int)
      case StrBar(str) => Right(str)
    }
  }
}

case class IntBar(int: Int) extends Bar

case class StrBar(str: String) extends Bar

given barEncoder: Encoder[Bar] = Encoder.ForEither.default[Int, String].contramap[Bar](_.toEither)

// --

sealed trait Foo

// @key("foo1")
case class Foo1[B <: Bar](b1: B) extends Foo

// @key("foo2")
case class Foo2[B <: Bar](b2: B) extends Foo

given foo1encoder: Encoder[Foo1[Bar]] = deriveEncoder

given foo2encoder: Encoder[Foo2[Bar]] = deriveEncoder

given fooEncoder: Encoder[Foo] = deriveEncoder

results in this compiler error:

@key collision: sub types `Bar]` and `Bar]` of ADT `Foo` share the same type id `Bar]`
    given fooEncoder: Encoder[Foo] = deriveEncoder

Kind of seems like the parsing of type id chokes on type bounds. I would expect the type ids for the Foo* case classes to be unique, specificallyFoo1 and Foo2, but both of them are Bar].

As a workaround, this error can be resolved by adding the manual @key annotations.


I also have a somewhat related question...

The above code (when it compiles with manual @key-s) gives me Encoder[Foo], which requires Encoder[Foo2[Bar]]. Both Foo2 and Encoder are invariant in their type params, so, strictly speaking, it seems to me that Encoder[Foo2[Bar]] is not the right type to encode Foo2[? <: Bar], which is what values of Foo are allowed to be.

In practice, this is fine when using fooEncoder, which accepts Foo, but if I wanted to directly use foo2Encoder (forget the rest of the Foo ADT for a moment), it would require an instance of Foo2[Bar], which I don't have, I only have instances of Foo2[? <: Bar]. And same for decoding, a similarly derived foo2decoder would give me an instance typed Foo2[Bar], whereas its output is actually an instance of Foo2[? <: Bar]. And so I need to liberally use asInstanceOf to adjust the types around such codecs.

Ultimately, the codecs are encoding and decoding the way I want, so I can keep doing what I'm doing, but I just wanted to raise the issue that this seems a bit unsound, in case it's an accidental omission, rather than a known limitation for the sake of simplicity.

sirthias commented 8 months ago

Thank you, @raquo, for the good ticket description! The code autogenerating the type-ids wasn't very good so far. I've improved it now, which takes care of this issue. It might be though that some so far "good" type-ids are now generated differently from what they were before, which would be an unfortunate breaking change. So we'll have to see. I'll publish a fresh release in a few minutes.

sirthias commented 8 months ago

Regarding your related question: The Encoder and Decoder types are intentionally invariant, because I've had hard to understand issues with non-invariant type classes before, especially with auto-derivation. The "magic" is already somewhat hard to understand. Bringing in type variance wouldn't make problems exactly easier to understand and debug.

Nevertheless I agree with you that the current derivation logic appears a bit "loose" for this case of ADT subtypes carrying their own invariant type parameters. But, as you say, sprinkling a few casts over the code at the right places should fix all issues and the generated code itself should always be fine.

Maybe it'd be possible to autogenerate the casts sometimes.

Can you show the casts that are most tedious and which you think that borer could maybe generate itself?

raquo commented 8 months ago

Thanks, yes, that makes sense. I definitely agree with encoder and decoder being invariant, if that wasn't the case I would have run into those issues myself as I was trying to wire things up, I know what you mean.

My canonical use case for bounded types is like this:

trait Row {
   ...
}

trait RowType[Rec <: Row] {
  given rowCodec: Codec[Rec]
}

object RowType {
  given codec: Codec[RowType[Row]] = ???
  given baseCodec: Codec[RowType[? <: Row]] = codec.asInstanceOf[Codec[RowType[? <: Row]]]
}

With this, codec is used as required by Borer (e.g. in derivation of types that have RowType[Rec] members for some [Rec <: Row] like the UpdateCommand below, and I use baseCodec manually when I need to e.g. parse the Json string of a RowType into an instance that I can use in my code, typed RowType[? <: Row].

That is the simpler case. Basically, all RowType-s are encoded / decoded with the same base codec.

I think most of the asInstanceOf-s I have fit that pattern - converting from types like {Codec,Encoder,Decoder} of RowType[Row] to the same of RowType[? <: Row]. But I'm not sure if this is a good, generally applicable conversion, that Borer should have built-in. It kinda seems specific to my knowledge about my own codecs.

For a more complicated case, consider these additional types:

case class User(...) extends Row

object User extends RowType[User] {
  override given rowCodec: Codec[User] = deriveCodec
}

case class UpdateCommand[Rec <: Row](
  rowType: RowType[Rec],
  changes: (Rec, Rec),    // prev, next values
  ...
)

object UpdateCommand {
  given baseCodec: Codec[UpdateCommand[? <: Row]] = ???
}

Here, we want a codec for UpdateCommand[? <: Row], but we don't have a codec for Row or ? <: Row, it just does not exist on its own. But, we have a codec for RowType[? <: Row], and having that, we can get a codec for a particular Rec that's hiding in the ? <: Row type signature. So, for UpdateCommand and similar such types, I made a manual codec that writes UpdateCommand into a single-element Map like {rowType: updateCommand}. Encoding is obvious, and for decoding, we read the rowType key, convert it to RowType[? <: Row] using RowType.baseCodec, then use the rowCodec on that instance to derive a decoder for UpdateCommand[Rec] - for this particular Rec. It's quite a handful of manual code, but it seems to work, and is fairly reusable for different types. Here is the implementation, for anyone curious:

def mapEncoder[A](
    getKey: A => String,
    valueEncoderForKey: String => Encoder[A]
  ): Encoder[A] = {
    new Encoder[A]:
      override def write(writer: Writer, value: A): Writer = {
        writer.writeMapOpen(size = 1)
        val key = getKey(value)
        val valueEncoder = valueEncoderForKey(key)
        writer.write(key)
        valueEncoder.write(writer, value)
        writer.writeMapClose()
      }
  }

  def mapDecoder[A](
    valueDecoderForKey: String => Decoder[A]
  ): Decoder[A] = {
    new Decoder[A]:

      private def overflowFailure(reader: Reader, size: Option[Int]): Nothing = {
        reader.overflow(s"Too many keys (${size.getOrElse(">1")}) for single-item Map decoder")
      }

      private def emptyMapFailure(reader: Reader): Nothing = {
        reader.validationFailure("Single-item Map decoder found empty map")
      }

      def _readValue(reader: Reader): A = {
        val key = reader.readString()
        val valueDecoder = valueDecoderForKey(key)
        val value = valueDecoder.read(reader)
        value
      }

      override def read(reader: Reader): A = {
        if (reader.hasMapHeader) {
          val size = reader.readMapHeader().toInt // #TODO[Integrity] It's a Long
          if (size > 1) {
            overflowFailure(reader, Some(size))
          } else if (size == 0) {
            emptyMapFailure(reader)
          } else {
            _readValue(reader)
          }
        } else if (reader.hasMapStart) {
          reader.readMapStart()
          if (reader.tryReadBreak()) {
            emptyMapFailure(reader)
          }
          val value = _readValue(reader)
          if (!reader.tryReadBreak()) {
            overflowFailure(reader, size = None)
          }
          value
        } else {
          reader.unexpectedDataItem(expected = "Map")
        }
      }
  }

(The codec might be a bit over-complicated because I based it on Borer's own Map codec without good understanding thereof)

I'm not sure if this UpdateCommand example is directly relevant to the questions of asInstanceOf-s, but I wanted to show an alternative use case with type bounds.

I'm not sure if any of this should be accounted for in Borer itself. I'm still figuring out the best ways to use Borer, and I don't have a good idea if my hacks are the right way to do it or not, or what kinds of consequences they might have if they lived in the library, etc. Unless you already have better ideas, I think I'd rather just live with asInstanceOf-s at least until I get more experience with all this.

raquo commented 8 months ago

And just to confirm, the reported key collision bug is indeed fixed, thanks 👍

sirthias commented 8 months ago

Hmm... ok.

Looking at your

object UpdateCommand {
  given baseCodec: Codec[UpdateCommand[? <: Row]] = ???
}

Shouldn't that just be something like the following?

object UpdateCommand {
  given [A <: Row :Encoder[A] :Decoder[A]]: Codec[UpdateCommand[A]] = ...
}

The codec for an UpdateCommand[A] should provide everything specific to this UpdateCommand and delegate everything specific to A to the Encoder[A] and Decoder[A] respectively. Wouldn't that work in your case?

raquo commented 8 months ago

Thanks for looking into it!

I do use this pattern you suggested for most things, and it works well.

However, for this use case, the problem is that given Codec[UpdateCommand[A]] = ... is not a single codec, it's a bunch of unrelated codecs, one for every A. Meanwhile, I have a single endpoint on the backend that accepts JSON-encoded UpdateCommand[? <: Row] as input, so if I used Encoder[UpdateCommand[A]] to output that JSON on the client, the backend wouldn't know which A this JSON is for, and thus would not know which Decoder[UpdateCommand[A]] it needs to use to decode the JSON.

In this simple case I could pass the knowledge of A in a side channel (outside of JSON), for example, instead of one /api/command backend endpoint, I could have a dynamic route for /api/command/${rowType}, where rowType is some information that is needed for the backend to determine which Decoder[UpdateCommand[A]] it should use.

However, this side-channel approach is very limited, e.g. I couldn't use it if the endpoint accepted a more complex data structure, such as a List of UpdateCommand[? <: Row] (each with potentially different A-s) instead of just one command.

So, I made the mapEncoder / mapDecoder helpers shown above to pass the information necessary to create a decoder in JSON itself.

sirthias commented 8 months ago

However, for this use case, the problem is that given Codec[UpdateCommand[A]] = ... is not a single codec, it's a bunch of unrelated codecs, one for every A.

Ok. If the logic is specific to the A then the given instance should be provided there, i.e. in the context of the Row subtype rather than at the UpdateCommand, no? The given in the UpdateCommand context provides everything that's not dependent on the actual A. Everything else should go the Row or the respective subtype. This way you have a clean separation of concerns.

And it's fine if the Codec[UpdateCommand[User]] is supplied in the companion object of the User. If its logic is specific to User then it should be provided there, rather than level of the UpdateCommand.

Meanwhile, I have a single endpoint on the backend that accepts JSON-encoded UpdateCommand[? <: Row] as input, so if I used Encoder[UpdateCommand[A]] to output that JSON on the client, the backend wouldn't know which A this JSON is for, and thus would not know which Decoder[UpdateCommand[A]] it needs to use to decode the JSON.

Yes, that's the classic "ADT problem" of having to somehow encode the type information. borer does it automatically if you have it derive a Codec[Row]. But since you need a Decoder[UpdateCommand[? <: Row]] and the encoding logic for UpdateCommand isn't independent of the Row subtype you are on your own.

Can you somehow restructure things to make the Codec[UpdateComment[A]] not care about the actual encoding logic for A? It seems like this would make things a lot cleaner...

raquo commented 8 months ago

Sorry for the delay!

Yes, in theory I could do it the way you suggest, make a Codec[Row]. But by design, Row is not and can not be a sealed trait, so I would still need to make that codec using my mapEncoder / mapDecoder helpers, there's no escaping that – so this just moves the need for those helpers from UpdateCommand to Row.

As you said, in principle having Codec[Row] is better for separation of concerns. However, it would also be inefficient in other contexts. For example, my UpdateCommand is for writing data, but I also have a similar QueryResponse type for reading data. Simplified:

case class QueryResponse[Rec <: Row](
  val rowType: RowType[Rec]
  val records: List[Rec]
)

Currently I create Codec[QueryResponse[? <: Row]] using my mapEncoder / mapDecoder helpers similarly to what I showed with UpdateCommand, and I get this kind of JSON:

{"SomeRec": {"rowType": "SomeRec", "records": [<rec1_json1>, <rec2_json>, ...]}}

Where "SomeRec" is the discriminator of Rec (rowType.key), and rec{1,2}_json are json values for the specific Rec type made with Codec[Rec].

In contrast, if I had a single universal Codec[Row], I could drop val rowType from QueryResponse, and it would serialize like this:

{"records": [{"SomeRec": <rec1_json1>}, {"SomeRec": <rec2_json>}, ...]}

As you see, now instead of wrapping the JSON into {"SomeRec": _} once, we have to do it N times, where N is the number of records in val records – could be up to hundreds or even thousands. TBF I haven't tried measuring any performance differences, but I just don't feel good about such redundancy, especially since it does not rid me of mapEncoder / mapDecoder.

So far I don't really think I can make it any better. It seems to work fine, just needed some helpers that I'm now reusing across several types.

sirthias commented 8 months ago

Ok, I see. In your case being able to manually pull the rowType out of the array and move it one level higher really does appear advantageous. But -- as you describe -- this is not something that borer could really help you with. For these cases borer tries to offer an API that makes it possible (and somewhat convenient) to define your encoding and decoding logic in a way that is readable and maintainable. Hopefully this works out in your case...

raquo commented 8 months ago

Borer's facilities for making custom codecs have been great so far :+1: There are a couple things I don't yet understand like the concept of map headers (i'm guessing it's some CBOR thing), but that doesn't stop me from successfully cargo-culting custom codecs from the built-in ones :)

sirthias commented 8 months ago

Yes, CBOR has the feature that a "map start marker" can contain the number of map items, which then doesn't require a "map stop marker", like JSONs } character. Sometimes it's better when you know ahead of time, when starting to read an Array or Map how many items are coming, e.g. because you can then reserve a buffer of the right size right away.

borer gives you format-agnostic helpers like this here:

def writeMapOpen(size: Int): this.type =
    if (target eq Json) writeMapStart() else writeMapHeader(size)

which take care of automatically selecting the best way to start and Array or Map, depending on the target format.

If all you need is JSON then you can skip that level and just directly write the JSON primitives like unsized "Map Starts".

raquo commented 8 months ago

Nice, thanks, I'll make use of that 🙏