thu-pacman / GeminiGraph

A computation-centric distributed graph processing system.
Apache License 2.0
316 stars 129 forks source link

Distributed algorithm segmentation fault (using SLURM + MVAPICH2) #5

Closed sunrise2575 closed 6 years ago

sunrise2575 commented 6 years ago

Hi,

I'm using MVAPICH 2.3b and SLURM 17.11.4 (with MUNGE, MariaDB, OpenSSL, PAM, etc....) on CentOS 7.4

I use three nodes. (Intel Xeon CPU cluster) sun07 is the master, which has slurmctld and slurmd daemon process, and sun08, sun09 are slaves, which only have slurmd daemon process.

MVAPICH compiling & installing was conducted by ./configure --with-pmi=pmi2 --with-pm=slurm as mentioned on the MVAPICH2 manual. (http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3b-userguide.pdf)

I tested linux command hostname and _mpihello.c. (https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_hello.c)

[heeyong@sun07 ~] srun -n 3 hostname
srun: job88 queued and waiting for resources
srun: job88 has been allocated resources
sun07
sun08
sun09
[heeyong@sun07 ~] srun -n 3 ./mpi_hello
srun: job 89 queued and waiting for resources
srun: job 89 has been allocated resources
Hello from task 0 on sun07!
MASTER: Number of MPI tasks is: 3
Hello from task 2 on sun09!
Hello from task 1 on sun08!

Those two tests were perfectly worked in multi-node situation.

However, If I run command like this:

[heeyong@sun07 gemini]$ srun -n 3 ./toolkits/pagerank /path/to/twitter-2010.binedgelist 41652230 20
srun: job 95 queued and waiting for resources
srun: job 95 has been allocated resources
[sun09:mpi_rank_2][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: sun09: task 2: Segmentation fault (core dumped)
srun: error: sun08: task 1: Segmentation fault (core dumped)

and no response anymore.

I'm sure that I clearly compiled and executed Gemini on the same MVAPICH version, like you mentioned at https://github.com/thu-pacman/GeminiGraph/issues/2

Finally, I traced manually (using printf) and I found that the program stucks at 1082~1102 line of core/graph.hpp

What is the problem...?

coolerzxw commented 6 years ago

Hi @sunrise2575 , is the input file properly encoded in the binary edge list format (i.e. each edge containing two integers in binary)? By the way, you can first try running the program using a single process (i.e. without "srun -n [N]") and see if it behaves normally.

sunrise2575 commented 6 years ago

Software

Hardware

I downloaded:

I decompressed both file to text edge list. using this code (https://github.com/btootoonchi/Anchored_2-Core)

I converted text edge list to binary edge list using my program with reference to GraphChi: (https://github.com/GraphChi/graphchi-cpp/blob/master/src/preprocessing/conversions.hpp)

^Csrun: interrupt (one more within 1 sec to abort) srun: step:141.0 task 1: running srun: step:141.0 tasks 0,2: exited abnormally ^Csrun: sending Ctrl-C to job 141.0 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Because the gemini gave no more responses during 10 minutes, I killed process manually.
By using top, I figured out that after showing segmentation fault,
it constantly uses exactly 2 cores with 100% utilization ... per each node.

**Conclusion**
- 1 node test is good
- 3 node test is bad
- the binary edge list doesn't have problem

**Plus**
I tried very small test using this text edge list:

0 1 1 2 2 3 3 4 4 0


which has five vertices and the overall topology is circular.
It works well in 1 node test, but not 3 node test... segmentation fault.

Please help me
coolerzxw commented 6 years ago

"By using top, I figured out that after showing segmentation fault, it constantly uses exactly 2 cores with 100% utilization ... per each node." According to the above behavior and your MPI environment being MVAPICH2, I think the problem might be that MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, &provided) does not return MPI_THREAD_MULTIPLE as required. You can try passing "MV2_ENABLE_AFFINITY=0" to run the command (i.e. before srun) according to this discussion.

sunrise2575 commented 6 years ago

MV2_ENABLE_AFFINITY=0 solves the whole problem. Thanks so much!