Open SemanticBeeng opened 6 years ago
Everything around shapeless and implicit derivation works around case classes
, so we can't really take that out of the picture. The right way to approach this is to keep the case classes
as the main way to define a schema but have different ways to get to how this case classes
are created.
There are ways (typically code generators) that take a schema defined externally (say, avro or protocol buffers) and convert it to a case class
.
This will flow similar to this:
.avro --> codeGen --> case class --> TypedDataset schema.
@SemanticBeeng Is this what you had in mind?
"shapeless and implicit (type class) derivation"
Yes, this is central, ofc.
"right way to approach this is to keep the case classes as the main way to define a schema"
Yes, indeed, main design principle.
This is already at the basis of frameless
, it seems - have not see it stated explicitly, though.
And the direct dependency on the spark
schema seems to raise a question it, so, worth confirming.
"code generators"
Yes, avro4s
does that very nicely.
But, on top of this, I am suggesting:
avro
schema at a bit higher level than spark
schema in the design of frameless
.avro
schema in parquet files - please see again from " useful to be able to use avro as schema in parquet files ".We know avro
is an accepted way to define microservice
apis (schema) and that it does schema evolution
well.
Sometimes the integration is at the data level (not api): but same principles, techniques and benefits will apply if we use avro
for data level integration.
Think analytics platform interfacing with a data lake
: would it not be much better to depend on an avro
based data schema
then spark
specific one?
There will be components in the analytics area that do not "know" spark
but still need to validate the schema they depend on, in the data itself.
Throughts?
@SemanticBeeng sorry for taking long to reply. I am on board to making the proper changes, if they need to happen, so that the code generated by avro4s is compatible with Frameless schemas (which for now are just simple case classes). I think there was a similar ask by @codeexplorer regarding scalaPB.
The gist of my suggestions was a design one and less about implementation.
"Putting the avro schema at a bit higher level than spark schema in the design of frameless."
At this time frameless
is bound to the spark
schema but does not have to be - seems to me.
It could use a data schema
expressed at a higher level, closer to the domain, given that there are conversions between higher (avro
, protobuf
) and spark
.
Did you get a chance to see this? "convert between sparkSQL schemas to avro data schema" https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L114-L131
From an implementation point of view alone, avro4s
and scalapb
are quite different: scalapb
generates very complex classes, mostly because the protobuf
workings require it.
Because of that, a few people consider it a poor way define the type system behind one's domain models.
Please review
protoless
https://github.com/julien-lafont/protoless
"ScalaPB, a protocol buffers compiler for scala, was the only serious library to work with protobuf in Scala, but it comes with: " "Two step code generation (protobuf -> java, java -> scala)" "And if you want to map your own model, you need a third wrapping level." "Heavy builder interface" "Custom lenses library"
strictify
https://github.com/cakesolutions/strictify
"Because Protbuf is not your typesystem"
Been thinking about this more.
Any interest to make frameless
about more than spark
only but cross frameworks?
Petastorm
"enables single / distributed training and evaluation of deep learning models directly from Apache Parquet format"Unischema
: unifies#DataFabric
schema over frameworks:#ApacheSpark
StructType,Tensorflow
tf.DType,#Numpy
numpy.dtype
https://twitter.com/semanticbeeng/status/1142400720324431873 Source https://eng.uber.com/petastorm/, https://github.com/uber/petastorm
Or somehow make frameless
typed schema work with Unischema
.
This would advance the "beyond data pipeline" agenda and get away from current "stringly typed" technologies like Airflow
and the like and back into functional programming
, so that this think kind of thinking would be mainstream.
Capturing data pipeline errors functionally with Writer Monads
Functional programming and Spark: do they mix?
(broken atm)At least bring data schema
under strong typed control.
Related : https://twitter.com/semanticbeeng/status/1139789288856571904
Would it make sense to be able to introduce support for
avro
schema forTypedDataSet
?The current code defines schema based on the
SparkSQL
"language": https://github.com/typelevel/frameless/blob/576eb675dbd121453679a57ae7117e4fb53d9212/dataset/src/main/scala/frameless/TypedDatasetForwarded.scala#L43-L44On the other hand
frameless
use a Scala types based "schema" to define data sets.Using something like
avro4s
the avro schema can be derived from types.It is quite useful to be able to use
avro
as schema inparquet
files for example: https://dzone.com/articles/understanding-how-parquetSee also "Write Avro records to a Parquet file.": https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L34
In
spark-bigquery
there is already a schema converter that could be use to map to and fromSparkSql
based schema.See "convert between sparkSQL schemas to avro data schema" https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L114-L131
Somewhat related to https://github.com/typelevel/frameless/issues/280.