ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Titan: ugni init error #28

Closed abouteiller closed 6 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


There is an initialization error that happens only on Titan.

mpirun -n $nc-1 someprogram will work as intended. mpirun -n $nc someprogram will spit some errors before malfunctioning

ugniinit[ 0] IMB-MPI1-u2d[0x7582ed]
ugniinit[ 1] IMB-MPI1-u2d[0x74d372]
ugniinit[ 2] IMB-MPI1-u2d[0x45527f]
ugniinit[ 3] IMB-MPI1-u2d[0x455162]
ugniinit[ 4] IMB-MPI1-u2d[0x4401a4]
ugniinit[ 5] IMB-MPI1-u2d[0x44fbf5]
ugniinit[ 6] IMB-MPI1-u2d[0x40a061]
ugniinit[ 7] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2aaaae2d9c36]
ugniinit[ 8] IMB-MPI1-u2d[0x409ee9]
[nid17794][[25667,1],4][../../../../../opal/mca/btl/ugni/btl_ugni_component.c:497:mca_btl_ugni_component_init] Failed to initialize uGNI module @ ../../../../../opal/mca/btl/ugni/btl_ugni_component.c:497
[nid17821][[25667,1],16][../../../../../opal/mca/btl/ugni/btl_ugni_component.c:497:mca_btl_ugni_component_init] Failed to initialize uGNI module @ ../../../../../opal/mca/btl/ugni/btl_ugni_component.c:497
[nid17821:26742] *** An error occurred in MPI_Bcast
[nid17821:26742] *** reported by process [1682112513,30]
[nid17821:26742] *** on communicator MPI_COMM_WORLD
[nid17821:26742] *** MPI_ERR_PROC_FAILED: Process Failure
[nid17821:26742] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[nid17821:26742] ***    and potentially your MPI job)
[titan-batch7:05614] 31 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan-batch7:05614] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Disabling the failure detector -mca mpi_ft_detector false does not stop the error.

The error is not present on the Cori machine where all things work as expected.

abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


This is due to spawning more process per node than physical cores (by default 1 process per HT). When spawning the right number of processes per node, behavior is nominal.