spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://spotify.github.io/scio
Apache License 2.0
2.55k stars 514 forks source link

Move scio-avro to magnolify-avro #4540

Open shnapz opened 1 year ago

shnapz commented 1 year ago

Currently scio-avro uses com.spotify.scio.avro.types.AvroType[T] that implements "case class <-> GenericRecord" codec using scala.reflect.macros._ in com.spotify.scio.avro.types.TypeProvider. This was implemented in 2017-2019.

clairemcginty commented 1 year ago

I was curious about performance differences between Magnolify and Scio Avro macros, so I set up a quick JMH test to compute average runtimes of Avro macro operations.

all jmh tests were run with 10 iterations, 10 warmup iterations, 1 fork, 1 thread on my m1 mac, java 11.

Method magnolify.avro.AvroType com.spotify.scio.avro.types.AvroType
to (CC => GenericRecord) 1,872.553 ns/op 241,657.438 ns/op
from (GenericRecord => CC) 848.955 ns/op 1,037.100 ns/op
schema 2.820 ns/op 2,994.886 ns/op

The Scio AvroType#to figure is not a typo 😬 would appreciate a second pair of eyes on the jmh test setup to rule out user code error, but yeah - scary, and extra motivation for us to migrate!

IMO we should deprecate Scio's AvroType in the next release and add read/write methods to scio-avro that support the type bound T : magnolify.AvroType. It would be ideal if we could overload data.saveAsTypedAvroFile/sc.typedAvroFile to have a separate signature supporting T <: HasAvroAnnotation and T: magnolify.AvroType, so that it would be an easier migration on the user side, but I'm not sure if the compiler will complain.

clairemcginty commented 1 year ago

As a followup, I was curious about BQ macro performance for Magnolify vs vanilla Scio, so I ran those too (same run parameters as for avro):

Method magnolify.bigquery.TableRowType com.spotify.scio.bigquery.types.BigQueryType
to (CC => TableRow) 19,622.473 ns/op 33,009.115 ns/op
from (TableRow => CC) 15,935.708 ns/op 15,213.949 ns/op
schema 2.808 ns/op 778.948 ns/op

the numbers here suggest that there may be some room for improvement on the Magnolify side 🤔