Closed buonuomo closed 2 years ago
Something I realized a week or two ago while converting the dataframe example program to use the safe interface here: sometimes we do have RDD
's (or Dataset
's) of reference types, most notable an RDD Row
or Dataset Row
. In such a case, we're not able to collect
from such an RDD
, as there's no Reify
instance for Row
(or any safe reference types, I believe). The fact that there's no Reify
instance is a good thing, as otherwise, we'd be able to get an Ur Row
, which is not safe, but this does reveal the need for a separate class of collection functions that return linearly bound reference types.
As it stands, there are not too many such functions in sparkle
, so simply adding analogs like collectJ
for each one will probably accomplish what we need to for now with minimal conceptual overhead and a fixed amount of boilerplate per relevant function. At first glance, it seems like a type-class solution might be able to eliminate these near-duplicate functions, but I suspect that doing so would adversely affect the comprehensibility of the relevant type signatures in nontrivial ways.
I'll publish a draft of this solution (along with the example dataframe program using it) soon, in order to demonstrate my proposed fix.
Sounds good, Noah! Thanks for the update.
Okay, I've finally added collectJ
and its companions. In addition I've ported the dataframe example program over to the safe interface, and it appears to work fine. However, I've run into a hurdle in porting over rdd-ops, which is that we don't have any safe reify or reflect instances for streams, meaning that reduce
doesn't work. At the moment, I'm not sure if writing these instances will simply involve wrapping the unsafe instances, or if it will require porting the entirety of jvm-streaming
to use linear types.
At the moment, I'm not sure if writing these instances will simply involve wrapping the unsafe instances, or if it will require porting the entirety of jvm-streaming to use linear types.
I think wrapping the unsafe instances could be a fair stop gap. Not wrapping will require porting jvm-streaming
which is small, but it depends on jvm-batching
and will be some effort to port the two (not a lot of effort, but still some).
After another look, I think this is good enough to merge as is. We can address the missing bits in subsequent PRs.
This is my rewrite of all the
SQL
modules in sparkle for the safe interface. Currently everything compiles, but I still need to test a few things, so it's not ready to merge yet.