typelevel / frameless

Expressive types for Spark.
Apache License 2.0
880 stars 138 forks source link

[feature] DatasetT #704

Open jpassaro opened 1 year ago

jpassaro commented 1 year ago

I'm reading about this library and think I'm going to use it in my next spark project. I'm really motivated by the ability to reduce needless runtime errors that should be detectable at compile time, and equally by wanting an ergonomic error channel for true runtime errors.

What i see gives me confidence that I can accomplish that using the cats integration with typed datasets. there's one thing that could make it a bunch more ergonomic: one of the biggest places i tend to get runtime errors is at the read-write boundary. say I'm trying to read a table that doesn't exist, or read/write where the schema on disk is incompatible with the one i expect. i can obviously handle this with the existing TypedDataset API after wrapping the IO boundaries in Sync[F].delay... but it would be nice to wrap not only Dataset manipulation but also the generation and subsequent manipulation in a type-safe DSL.

To that end two more-or-less isomorphic ideas come to mind. Both expect at a minimum evidence of Monad[F] (maybe only flatmap for the first one and TypedEncoder[A].

1) additional syntax for F[TypedDataset[A]] that adds all the TypedDataset methods, also wrapped in F[_].

2) an OptionT-like data class wrapping F[TypedDataset[A]]. Naming can be debated but for the sake of presentation, DatasetT. it has default constructor

def apply[F[_]: FlatMap: Ask[*[_], SparkSession], A:TypedEncoder](f: SparkSession => F[TypedDataset[A]])

and various syntactically or situationally preferable variations.

Have either of these patterns been considered? is there any reason they wouldn't make sense to adopt?

Assuming not, I'll try writing in a new project, and -- assuming it proves itself -- will create a PR.

pomadchin commented 1 year ago

Hey @jpassaro, I'd be happy what you will compe up with! Passing SparkSession through the context is definitely a good idea and works nicely!

I'll just add that Dataframe / Dataset itself is a DSL and represent an execution description. Ideally operations on it are not effectful until the reduce is invoked explicitly. In reality some operations may still be effectful (i.e. df metadata interactions). So to make a nice API can be challenging there.