typelevel / frameless

Expressive types for Spark.
Apache License 2.0
880 stars 138 forks source link

Use Avro as schema in TypedDataSet #282

Open SemanticBeeng opened 6 years ago

SemanticBeeng commented 6 years ago

Would it make sense to be able to introduce support for avro schema for TypedDataSet?

The current code defines schema based on the SparkSQL "language": https://github.com/typelevel/frameless/blob/576eb675dbd121453679a57ae7117e4fb53d9212/dataset/src/main/scala/frameless/TypedDatasetForwarded.scala#L43-L44

On the other hand frameless use a Scala types based "schema" to define data sets.

Using something like avro4s the avro schema can be derived from types.

It is quite useful to be able to use avro as schema in parquet files for example: https://dzone.com/articles/understanding-how-parquet

$ export HADOOP_CLASSPATH=parquet-avro-1.4.3.jar:parquet-column-1.4.3.jar:parquet-common-1.4.3.jar:parquet-encoding-1.4.3.jar:parquet-format-2.0.0.jar:parquet-generator-1.4.3.jar:parquet-hadoop-1.4.3.jar:parquet-hive-bundle-1.4.3.jar:parquet-jackson-1.4.3.jar:parquet-tools-1.4.3.jar

$ hadoop parquet.tools.Main meta stocks.parquet
creator:     parquet-mr (build 3f25ad97f209e7653e9f816508252f850abd635f)
extra:       avro.schema = {"type":"record","name":"Stock","namespace" [more]...

file schema: hip.ch5.avro.gen.Stock
--------------------------------------------------------------------------------
symbol:      REQUIRED BINARY O:UTF8 R:0 D:0
date:        REQUIRED BINARY O:UTF8 R:0 D:0
open:        REQUIRED DOUBLE R:0 D:0
high:        REQUIRED DOUBLE R:0 D:0
low:         REQUIRED DOUBLE R:0 D:0
close:       REQUIRED DOUBLE R:0 D:0
volume:      REQUIRED INT32 R:0 D:0
adjClose:    REQUIRED DOUBLE R:0 D:0

See also "Write Avro records to a Parquet file.": https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L34

In spark-bigquery there is already a schema converter that could be use to map to and from SparkSql based schema.

See "convert between sparkSQL schemas to avro data schema" https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L114-L131

Somewhat related to https://github.com/typelevel/frameless/issues/280.

imarios commented 6 years ago

Everything around shapeless and implicit derivation works around case classes, so we can't really take that out of the picture. The right way to approach this is to keep the case classes as the main way to define a schema but have different ways to get to how this case classes are created.

There are ways (typically code generators) that take a schema defined externally (say, avro or protocol buffers) and convert it to a case class.

This will flow similar to this:

.avro --> codeGen --> case class --> TypedDataset schema.

@SemanticBeeng Is this what you had in mind?

SemanticBeeng commented 6 years ago

"shapeless and implicit (type class) derivation"

Yes, this is central, ofc.

"right way to approach this is to keep the case classes as the main way to define a schema"

Yes, indeed, main design principle. This is already at the basis of frameless, it seems - have not see it stated explicitly, though. And the direct dependency on the spark schema seems to raise a question it, so, worth confirming.

"code generators"

Yes, avro4s does that very nicely.

But, on top of this, I am suggesting:

  1. Putting the avro schema at a bit higher level than spark schema in the design of frameless.
  2. Providing first class support to write the avro schema in parquet files - please see again from " useful to be able to use avro as schema in parquet files ".

We know avro is an accepted way to define microservice apis (schema) and that it does schema evolution well.

Sometimes the integration is at the data level (not api): but same principles, techniques and benefits will apply if we use avro for data level integration.

Think analytics platform interfacing with a data lake: would it not be much better to depend on an avro based data schema then spark specific one?

There will be components in the analytics area that do not "know" spark but still need to validate the schema they depend on, in the data itself.

Throughts?

imarios commented 6 years ago

@SemanticBeeng sorry for taking long to reply. I am on board to making the proper changes, if they need to happen, so that the code generated by avro4s is compatible with Frameless schemas (which for now are just simple case classes). I think there was a similar ask by @codeexplorer regarding scalaPB.

SemanticBeeng commented 6 years ago

The gist of my suggestions was a design one and less about implementation.

"Putting the avro schema at a bit higher level than spark schema in the design of frameless."

At this time frameless is bound to the spark schema but does not have to be - seems to me. It could use a data schema expressed at a higher level, closer to the domain, given that there are conversions between higher (avro, protobuf) and spark.

Did you get a chance to see this? "convert between sparkSQL schemas to avro data schema" https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L114-L131

SemanticBeeng commented 6 years ago

From an implementation point of view alone, avro4s and scalapb are quite different: scalapb generates very complex classes, mostly because the protobuf workings require it.

Because of that, a few people consider it a poor way define the type system behind one's domain models.

Please review

  1. protoless https://github.com/julien-lafont/protoless

    "ScalaPB, a protocol buffers compiler for scala, was the only serious library to work with protobuf in Scala, but it comes with: " "Two step code generation (protobuf -> java, java -> scala)" "And if you want to map your own model, you need a third wrapping level." "Heavy builder interface" "Custom lenses library"

  2. strictify https://github.com/cakesolutions/strictify

    "Because Protbuf is not your typesystem"

SemanticBeeng commented 5 years ago

Been thinking about this more.

Any interest to make frameless about more than spark only but cross frameworks?

Petastorm "enables single / distributed training and evaluation of deep learning models directly from Apache Parquet format" Unischema : unifies #DataFabric schema over frameworks: #ApacheSpark StructType, Tensorflow tf.DType, #Numpy numpy.dtype

https://twitter.com/semanticbeeng/status/1142400720324431873 Source https://eng.uber.com/petastorm/, https://github.com/uber/petastorm

Or somehow make frameless typed schema work with Unischema.

This would advance the "beyond data pipeline" agenda and get away from current "stringly typed" technologies like Airflow and the like and back into functional programming, so that this think kind of thinking would be mainstream.

At least bring data schema under strong typed control.

Related : https://twitter.com/semanticbeeng/status/1139789288856571904