tweag / sparkle

Haskell on Apache Spark.
BSD 3-Clause "New" or "Revised" License
447 stars 30 forks source link

Safe Spark SQL modules #160

Closed buonuomo closed 2 years ago

buonuomo commented 3 years ago

This is my rewrite of all the SQL modules in sparkle for the safe interface. Currently everything compiles, but I still need to test a few things, so it's not ready to merge yet.

buonuomo commented 3 years ago

Something I realized a week or two ago while converting the dataframe example program to use the safe interface here: sometimes we do have RDD's (or Dataset's) of reference types, most notable an RDD Row or Dataset Row. In such a case, we're not able to collect from such an RDD, as there's no Reify instance for Row (or any safe reference types, I believe). The fact that there's no Reify instance is a good thing, as otherwise, we'd be able to get an Ur Row, which is not safe, but this does reveal the need for a separate class of collection functions that return linearly bound reference types.

As it stands, there are not too many such functions in sparkle, so simply adding analogs like collectJ for each one will probably accomplish what we need to for now with minimal conceptual overhead and a fixed amount of boilerplate per relevant function. At first glance, it seems like a type-class solution might be able to eliminate these near-duplicate functions, but I suspect that doing so would adversely affect the comprehensibility of the relevant type signatures in nontrivial ways.

I'll publish a draft of this solution (along with the example dataframe program using it) soon, in order to demonstrate my proposed fix.

facundominguez commented 3 years ago

Sounds good, Noah! Thanks for the update.

buonuomo commented 3 years ago

Okay, I've finally added collectJ and its companions. In addition I've ported the dataframe example program over to the safe interface, and it appears to work fine. However, I've run into a hurdle in porting over rdd-ops, which is that we don't have any safe reify or reflect instances for streams, meaning that reduce doesn't work. At the moment, I'm not sure if writing these instances will simply involve wrapping the unsafe instances, or if it will require porting the entirety of jvm-streaming to use linear types.

facundominguez commented 3 years ago

At the moment, I'm not sure if writing these instances will simply involve wrapping the unsafe instances, or if it will require porting the entirety of jvm-streaming to use linear types.

I think wrapping the unsafe instances could be a fair stop gap. Not wrapping will require porting jvm-streaming which is small, but it depends on jvm-batching and will be some effort to port the two (not a lot of effort, but still some).

facundominguez commented 2 years ago

After another look, I think this is good enough to merge as is. We can address the missing bits in subsequent PRs.