Open joshdevins opened 7 years ago
Super nodes in this case might actually not be very big -- like 100k out edges even. Consider a broadcast join (although this can be huge) or a "duplicate" join, where the vertices dataset is artificially duplicated to add randomness and allow for uniformly distributed partitions.
Ideas:
When there are super nodes in the graph with hundreds of thousands or millions of out-edges, the PageRank join operation performs very poorly (stragglers) with skewed partitions.