Custom Spark Partitioner

Right now, the most expensive transformation in our application is the use of reduceByKey, which induces a network shuffle and subsequent repartition of the RDD. If we can reduce or even eliminate this network penalty we should see substantial improvement. We may be able to do this with a custom partitioner.

Example: http://stackoverflow.com/questions/30677095/pyspark-repartitioning-rdd-elements

quinngroup / dr1dl-pyspark

Custom Spark Partitioner #60