Open jgogstad opened 4 years ago
Hi @jgogstad Yes we are aware of this pattern and we actually leverage it internally. We could however, improve our docs in advocating for this a little bit more. I think @jto has something worked out already.
Decoupling Coder
from scio-core
is an interesting idea, but there are a few places where we rely on Beam
, so not sure if it will be possible. (I mean it would be possible to decouple but we would still bring some Beam deps)
I'll have a look and I'll get back to you.
Since the primary goal here is to improve compilation time, and caching Coder seems to be providing a good benefit in a lot of cases, I am wondering what if we have this as a opt-in
feature, where the user can explicitly choose to cache the Coders for certain types. This will be an advanced feature though.
So we can write
@WithCachedCoder
case class A(a:Int, b: Int)
The coder derivation macro works like it does today, however it returns a previously cached tree only if the WithCachedCoder
annotation is present for that type.
or (not sure if this is possible) this annotation can be a macro which adds the implicit
coder instance in the companion object during compilation. ( I understand this can be very tricky especially which fall back cases)
Since the primary goal here is to improve compilation time, and caching Coder seems to be providing a good benefit in a lot of cases, I am wondering what if we have this as a opt-in feature, where the user can explicitly choose to cache the Coders for certain types. This will be an advanced feature though.
This primary concern in this issue is actually making those cached coders available and ergonomic. Given an API module with case classes, it's much more ergonomic to have coder instances on the default implicit search path (companion object, package object, …). Macro or not, it's currently impossible without adding 100+ jars the APIs dependencies (transitively via scio-core).
A workaround with the current limitations is to add another API module, have scio-core in that module's dependencies, and then use newtypes or similar to get the coders on companion objects.
Decoupling Coder from scio-core is an interesting idea, but there are a few places where we rely on Beam, so not sure if it will be possible.
I suspected so. So it sounds like separating the coder infrastructure entails separating default instances as well. In my opinion that's not bad at all. Users can opt-in to the default instances by importing com.spotify.scio.generic.auto._
from scio-generic
for instance.
Every coder depends on Beam directly or indirectly. The Scio Coder
instances don't have any serialization logic in them and all of them defer to Beam eventually.
The only thing we could isolate would be the Coder
trait itself, but it would be pretty useless since you would basically have no way to actually implement a Coder
. I'm really not sure having coders not depend on Beam is even possible, let alone the derivation.
One thing we could do would be to get rid of beam-sdks-java-io-google-cloud-platform
from scio-core
's dependencies. It should not be a dep in the first place anyway. Doing so would remove most of the dependencies (beam core has few deps) but would not fix the issue completely.
An easier option would be to have coders in their own module to isolate its deps but again, it only partially fixes the issue.
So one solution could be to create a scio specific algebra for coders, express all coders in this algebra and then keep a mapping from the custom algebra to Beam coders in scio-core
.
Just reasoning on a high level here, and I realize that might mean rewrite the whole coder infrastructure. Please feel free to close this issue if it's irrelevant, unreasonable or otherwise not applicable to the project. I see that the org.apache.beam.sdk.coders.Coder
hierarchy is quite big, so maybe it's not worth the effort.
The auto derivation of
Coder
instances has a large impact on client's compile times (discussed in https://github.com/spotify/scio/issues/2811, https://github.com/spotify/scio/issues/1898). One way to work around this is to explicitly cache derived coders in the default implicit scope for the types in question, so for exampleAnd then do the same for
Artist
,Boiler
andCar
.We've reduced compilation times with minutes by using this technique. You'll recognize that this pattern is very similar to the "semiauto" pattern that we find in libraries such as circe and pureconfig.
We have data models that are shared between modules in a separate sbt module, this module is typically quite contained and it only brings in circe, pureconfig or whatever library it ships encoder/decoders for.
Adding
scio
to such API modules is problematic because you then put that enormous dependency graph on all other upstream modules. Would it be possible to refactor theCoder
mechanism to a separate module that brings with it a minimal set of dependencies?Completely independent of this, having a pattern of "opt-in" for auto or semi auto derivation would be good I think.