typelevel / frameless

Expressive types for Spark.
Apache License 2.0
877 stars 138 forks source link

Add TypedEncoder for shapeless Record. #777

Open tribbloid opened 11 months ago

tribbloid commented 11 months ago

Since RecordEncoder is already converting any product type into shapeless Record:

class RecordEncoder[F, G <: HList, H <: HList](
    implicit
    i0: LabelledGeneric.Aux[F, G],
    i1: DropUnitValues.Aux[G, H],
    i2: IsHCons[H],
    fields: Lazy[RecordEncoderFields[H]],
    newInstanceExprs: Lazy[NewInstanceExprs[G]],
    classTag: ClassTag[F])
    extends TypedEncoder[F] {
...

The only thing required is to break it into 2 stages, such that the intermediate HList/Record representation could serve as a more flexible type-level schema, it could even approximate the capability of the abandoned TypedDataFrame

I also realised that i0~i2 are not used in the function body. i2 is important to not accept HNil, but are i0 & i1 necessary?

pomadchin commented 11 months ago

Hey there; indeed in RecordEncoder those are not required. However these implicits are necessary for the TypedEncoder.usingDerivation function.

I don't think that these are bad here, at least work as a sanity check for us. But any improvement PRs are very much welcome.

Didn't quite follow a part about the shapeless.Record and two stages; usually the idea is to hide shapeless inside and not let it leak into the user API. But shoot a PR I'd be happy to help you to get it merged :+1:

tribbloid commented 11 months ago

@pomadchin voila, adding an experimental PR that adopts 2-stage RecordEncoder derivation.

The 1st stage is now also used for TypedRow[T <: HList], which can be seen as a successor of the abandoned TypedDataFrame. See the new test for an example usage:

https://github.com/typelevel/frameless/pull/778/commits/0baf604bff735d40c68994e89b9af63de573b30b#diff-dd83f3b1d1a249804b5620473177ce6034efbc5f36b45a9b1ef01283cafd50f9R93

tribbloid commented 11 months ago

it is only an experiment, will need some serious clean up (particularly the scalafmt part) and API revision before it becomes a feature.

Ideally, I would like to see schema-changing transformation like withColumnReplaced and withColumn yielding a:

TypedDataFrame = TypedDataSet[TypedRow[H]], which preserve both generics & labels of the source case class.

Instead of the old:

TypedDataSet[TupleX], which degrade all columns name into _1, _2, etc.