thu-pacman / GeminiGraph

A computation-centric distributed graph processing system.
Apache License 2.0
310 stars 128 forks source link

Consistency check for partition boundaries failed #34

Open JohnMalliotakis opened 2 years ago

JohnMalliotakis commented 2 years ago

Hi, I'm trying to run PageRank over two MPI (MPICH v4.0.2) hosts, with two NUMA nodes each. The input graph is very large (>4B vertices) therefore I converted Gemini to use uint64_t vertex IDs. Everything seems to work fine until the locality chunking phase and the subsequent computation of partition offsets.

At that point Gemini fails on the assertion at line 854 in core/graph.hpp, and indeed by adding some debug prints I can see that the two machines have computed different partition offsets for NUMA node 1.

I was able to avoid this failure by adding an MPI_Allreduce call before the MPI_Allreduce call which sets up the global_partition_offset array, which uses MPI_MAX to immediately store the max computed partition offsets directly into the local partition offset array. However I am not sure if this is entirely correct.

Any ideas on a possible cause for the issue (also is my solution correct) ?

Note: I have forked the repo so you can take a look at my modifications.