Open rnowling opened 9 years ago
With respect to Spark, you can define custom partitioners, like I did here: https://github.com/roofmonkey/cambio/blob/eje/common/src/main/scala/com/redhat/et/cambio/common/trees.scala#L1469
I'm not sure if there is any added abstraction you can layer on top of Partitioner
that would make it easier.
Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically. My intuition is that this might be preferable, as you sit on top of any lower-level optimizations that the spark community comes up with.
Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically.
This is what I've suggested in the past as well.
Potential efficiency benefits aside, manipulating keys is an easier way to think about the problem, and it is totally general, since there is no restriction on key data types except an ordering relation.
Since the key space is unrestricted, there are going to be all manner of possible formalisms you can embed within that space.
@erikerlandson Agreed; the issue AIUI is that we'd need to guarantee (for the use cases that @rnowling immediately cares about, at least) that each partition would have exactly one key. So the naïve solution is a little clunky: count the number of keys, repartition to that number, and make a HashPartitioner
that assigns each key to a unique partition. I think the easiest way to do the second step is rdd.intersect(rdd, partitioner)
but there might be a better option.
Well, maybe the cut
example is informative for that. You can set up a key space, and instantiate a partitioner that maps each key to its own partition.
In that case, the map would map each key to a unique parition, and numPartitions
would just return the size of that map
Maybe a formalism built around RDDs that specifically map each key to unique partition, using a partitioner built around the ideas above, combined with interesting abstractions for manipulating they keys themselves, which is where the interesting action would be, and where the programming effort would focus.
Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used and provide utilities to make this easier. Maybe good to define high level tasks that need customer partitioning and present interfaces for those as well.