Request: Separate `Coder` infrastructure into contained artifact

jgogstad commented 4 years ago

The auto derivation of Coder instances has a large impact on client's compile times (discussed in https://github.com/spotify/scio/issues/2811, https://github.com/spotify/scio/issues/1898). One way to work around this is to explicitly cache derived coders in the default implicit scope for the types in question, so for example

case class MyCaseClass(a: Artist, b: Boiler, c: Car)

object MyCaseClass {
  implicit val coder: Coder[MyCaseClass] = Coder.gen[MyCaseClass]
}

And then do the same for Artist, Boiler and Car.

We've reduced compilation times with minutes by using this technique. You'll recognize that this pattern is very similar to the "semiauto" pattern that we find in libraries such as circe and pureconfig.

We have data models that are shared between modules in a separate sbt module, this module is typically quite contained and it only brings in circe, pureconfig or whatever library it ships encoder/decoders for.

Adding scio to such API modules is problematic because you then put that enormous dependency graph on all other upstream modules. Would it be possible to refactor the Coder mechanism to a separate module that brings with it a minimal set of dependencies?

Completely independent of this, having a pattern of "opt-in" for auto or semi auto derivation would be good I think.

regadas commented 4 years ago

Hi @jgogstad Yes we are aware of this pattern and we actually leverage it internally. We could however, improve our docs in advocating for this a little bit more. I think @jto has something worked out already.

Decoupling Coder from scio-core is an interesting idea, but there are a few places where we rely on Beam, so not sure if it will be possible. (I mean it would be possible to decouple but we would still bring some Beam deps)

I'll have a look and I'll get back to you.

anish749 commented 4 years ago

Since the primary goal here is to improve compilation time, and caching Coder seems to be providing a good benefit in a lot of cases, I am wondering what if we have this as a opt-in feature, where the user can explicitly choose to cache the Coders for certain types. This will be an advanced feature though.

So we can write

@WithCachedCoder
case class A(a:Int, b: Int)

The coder derivation macro works like it does today, however it returns a previously cached tree only if the WithCachedCoder annotation is present for that type.

or (not sure if this is possible) this annotation can be a macro which adds the implicit coder instance in the companion object during compilation. ( I understand this can be very tricky especially which fall back cases)

jgogstad commented 4 years ago

Since the primary goal here is to improve compilation time, and caching Coder seems to be providing a good benefit in a lot of cases, I am wondering what if we have this as a opt-in feature, where the user can explicitly choose to cache the Coders for certain types. This will be an advanced feature though.

This primary concern in this issue is actually making those cached coders available and ergonomic. Given an API module with case classes, it's much more ergonomic to have coder instances on the default implicit search path (companion object, package object, …). Macro or not, it's currently impossible without adding 100+ jars the APIs dependencies (transitively via scio-core).

A workaround with the current limitations is to add another API module, have scio-core in that module's dependencies, and then use newtypes or similar to get the coders on companion objects.

Decoupling Coder from scio-core is an interesting idea, but there are a few places where we rely on Beam, so not sure if it will be possible.

I suspected so. So it sounds like separating the coder infrastructure entails separating default instances as well. In my opinion that's not bad at all. Users can opt-in to the default instances by importing com.spotify.scio.generic.auto._ from scio-generic for instance.

jto commented 4 years ago

Every coder depends on Beam directly or indirectly. The Scio Coder instances don't have any serialization logic in them and all of them defer to Beam eventually. The only thing we could isolate would be the Coder trait itself, but it would be pretty useless since you would basically have no way to actually implement a Coder. I'm really not sure having coders not depend on Beam is even possible, let alone the derivation.

One thing we could do would be to get rid of beam-sdks-java-io-google-cloud-platform from scio-core's dependencies. It should not be a dep in the first place anyway. Doing so would remove most of the dependencies (beam core has few deps) but would not fix the issue completely.

An easier option would be to have coders in their own module to isolate its deps but again, it only partially fixes the issue.

jgogstad commented 4 years ago

So one solution could be to create a scio specific algebra for coders, express all coders in this algebra and then keep a mapping from the custom algebra to Beam coders in scio-core.

Just reasoning on a high level here, and I realize that might mean rewrite the whole coder infrastructure. Please feel free to close this issue if it's irrelevant, unreasonable or otherwise not applicable to the project. I see that the org.apache.beam.sdk.coders.Coder hierarchy is quite big, so maybe it's not worth the effort.

spotify / scio

Request: Separate `Coder` infrastructure into contained artifact #3018