ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Bugs in running SC18 tutorial code in the provided docker container #49

Open abouteiller opened 5 years ago

abouteiller commented 5 years ago

Original report by Luanzheng Guo (Bitbucket: [Luanzheng Guo](https://bitbucket.org/Luanzheng Guo), ).


I am learning ULFM and playing with these tutorial examples.

When running these tutorial examples (10.respawn, 11.respawn_reorder, 12.buddycr, and the jacobi example) in the docker image. I run them with different number of MPI processes (in particular, 4, 7, 8, 15, 16), sometimes the execution hangs and print out the following error message I get.

For example, I was running 10.respawn with 15 MPI processes for 10 times, I saw the following error message for 4 times. You can easily repeat this error by running the same execution for multiple times.

After debugging using gdb, I guess the problem may derive from the MPIX_Comm_agree function? Or could there be something wrong with the docker image?

Thanks!

abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Cannot reproduce the segfault, but can reproduce a deadlock.

Action plan is to produce an updated docker with v4.0.1ulfm2.1rc1 asap and verify if fixes the issue.