Utilities for custom partitioning

radanalyticsio / silex

something to help you spark

Apache License 2.0

65 stars 13 forks source link

Utilities for custom partitioning #31

Open rnowling opened 9 years ago

rnowling commented 9 years ago

Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used and provide utilities to make this easier. Maybe good to define high level tasks that need customer partitioning and present interfaces for those as well.

erikerlandson commented 9 years ago

With respect to Spark, you can define custom partitioners, like I did here: https://github.com/roofmonkey/cambio/blob/eje/common/src/main/scala/com/redhat/et/cambio/common/trees.scala#L1469

I'm not sure if there is any added abstraction you can layer on top of Partitioner that would make it easier.

Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically. My intuition is that this might be preferable, as you sit on top of any lower-level optimizations that the spark community comes up with.

willb commented 9 years ago

Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically.

This is what I've suggested in the past as well.

erikerlandson commented 9 years ago

Potential efficiency benefits aside, manipulating keys is an easier way to think about the problem, and it is totally general, since there is no restriction on key data types except an ordering relation.

Since the key space is unrestricted, there are going to be all manner of possible formalisms you can embed within that space.

willb commented 9 years ago

@erikerlandson Agreed; the issue AIUI is that we'd need to guarantee (for the use cases that @rnowling immediately cares about, at least) that each partition would have exactly one key. So the naïve solution is a little clunky: count the number of keys, repartition to that number, and make a HashPartitioner that assigns each key to a unique partition. I think the easiest way to do the second step is rdd.intersect(rdd, partitioner) but there might be a better option.

erikerlandson commented 9 years ago

Well, maybe the cut example is informative for that. You can set up a key space, and instantiate a partitioner that maps each key to its own partition.

erikerlandson commented 9 years ago

In that case, the map would map each key to a unique parition, and numPartitions would just return the size of that map

erikerlandson commented 9 years ago

Maybe a formalism built around RDDs that specifically map each key to unique partition, using a partitioner built around the ideas above, combined with interesting abstractions for manipulating they keys themselves, which is where the interesting action would be, and where the programming effort would focus.