quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

Custom Spark Partitioner #60

Open magsol opened 8 years ago

magsol commented 8 years ago

Right now, the most expensive transformation in our application is the use of reduceByKey, which induces a network shuffle and subsequent repartition of the RDD. If we can reduce or even eliminate this network penalty we should see substantial improvement. We may be able to do this with a custom partitioner.

Example: http://stackoverflow.com/questions/30677095/pyspark-repartitioning-rdd-elements