soundcloud / spark-pagerank

PageRank in Spark
https://soundcloud.github.io/spark-pagerank
MIT License
74 stars 14 forks source link

Super nodes cause poor join performance due to partition skew #38

Open joshdevins opened 7 years ago

joshdevins commented 7 years ago

When there are super nodes in the graph with hundreds of thousands or millions of out-edges, the PageRank join operation performs very poorly (stragglers) with skewed partitions.

joshdevins commented 7 years ago

Super nodes in this case might actually not be very big -- like 100k out edges even. Consider a broadcast join (although this can be huge) or a "duplicate" join, where the vertices dataset is artificially duplicated to add randomness and allow for uniformly distributed partitions.

joshdevins commented 7 years ago

Ideas: