Closed abouteiller closed 6 years ago
Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
This is due to spawning more process per node than physical cores (by default 1 process per HT). When spawning the right number of processes per node, behavior is nominal.
Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
There is an initialization error that happens only on Titan.
mpirun -n $nc-1 someprogram
will work as intended.mpirun -n $nc someprogram
will spit some errors before malfunctioningDisabling the failure detector
-mca mpi_ft_detector false
does not stop the error.The error is not present on the Cori machine where all things work as expected.