Right now, the most expensive transformation in our application is the use of reduceByKey, which induces a network shuffle and subsequent repartition of the RDD. If we can reduce or even eliminate this network penalty we should see substantial improvement. We may be able to do this with a custom partitioner.
Right now, the most expensive transformation in our application is the use of
reduceByKey
, which induces a network shuffle and subsequent repartition of the RDD. If we can reduce or even eliminate this network penalty we should see substantial improvement. We may be able to do this with a custom partitioner.Example: http://stackoverflow.com/questions/30677095/pyspark-repartitioning-rdd-elements