Open abouteiller opened 5 years ago
Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
Cannot reproduce the segfault, but can reproduce a deadlock.
Action plan is to produce an updated docker with v4.0.1ulfm2.1rc1 asap and verify if fixes the issue.
Original report by Luanzheng Guo (Bitbucket: [Luanzheng Guo](https://bitbucket.org/Luanzheng Guo), ).
I am learning ULFM and playing with these tutorial examples.
When running these tutorial examples (10.respawn, 11.respawn_reorder, 12.buddycr, and the jacobi example) in the docker image. I run them with different number of MPI processes (in particular, 4, 7, 8, 15, 16), sometimes the execution hangs and print out the following error message I get.
For example, I was running 10.respawn with 15 MPI processes for 10 times, I saw the following error message for 4 times. You can easily repeat this error by running the same execution for multiple times.
After debugging using gdb, I guess the problem may derive from the MPIX_Comm_agree function? Or could there be something wrong with the docker image?
Thanks!