tweag / sparkle

Haskell on Apache Spark.
BSD 3-Clause "New" or "Revised" License
447 stars 30 forks source link

Batch marshaling of values #124

Closed facundominguez closed 6 years ago

facundominguez commented 6 years ago

Values of composite types like tuples and arrays require multiple calls to reify or reflect to marshal each value between Haskell and Java. When we have RDDs or Datasets containing these types, it is often cheaper to marshal multiple values together.

To be concrete, suppose we have an rdd :: RDD (Double, Double), and we want to evaluate RDD.map (closure (static (\(x, y) -> (y, x))) rdd. Marshaling a pair requires one reify and one reflect call per component, if we have n elements in the rdd, we do 4*n calls in total.

If we instead marshal a batch of pairs, we can do one reify and one reflect call to marshal an array with the first components of all the pairs in the batch, and similarly for the second components. Thus we need only 2 reify calls and 2 reflect calls per batch. If b is the amount of pairs in a batch, we do 4*n/b calls.

We also have this implemented in a private fork, waiting to be ported. It is implemented as a composable scheme, meaning that if one has a way to batch the components, then one can batch the composite type.

facundominguez commented 6 years ago

Work in progress: https://github.com/tweag/inline-java/compare/fd/batching

facundominguez commented 6 years ago

The next steps are:

facundominguez commented 6 years ago

I'm going to invest some time to fix distributed-closure to address one of the todo's in https://github.com/tweag/inline-java/pull/109.