Iallreduce gives an incorrect result for algorithms 1 and 3

donners-atos commented 2 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

repo_rev=v4.1.1-30-g535358e937

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

installed from the hpcx-2.9.0 package

Please describe the system on which you are running

Operating system/version: rhel 8.3
Computer hardware: Dual-socket Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 256 GB
Network type: 1 node, shared memory (it also happens with >1000 processes on nodes connected with InfiniBand)

Details of the problem

OMPI_MCA_coll_libnbc_iallreduce_algorithm=1 or 3 gives incorrect results from 4 processes. Here's a small program that reproduces the problem.

#include "mpi.h"
#include <stdio.h>

int main(int argc, char *argv[]){
  double tmp[2],expected[2];
  int i,rank, mpisize;
  MPI_Request req;

  MPI_Init(&argc, &argv);
  rank=-1;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &mpisize);
  tmp[0]=(double)rank;
  tmp[1]=0;
  if (rank!=0) tmp[1]=1.0/(double)rank;
  MPI_Iallreduce(MPI_IN_PLACE, tmp, 2, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD, &req);

  MPI_Wait(&req, MPI_STATUS_IGNORE);
  printf("tmp: %5d %f %f\n",rank, tmp[0],tmp[1]);
  expected[0]=0.0;
  expected[1]=0.0;
  for (i=0;i<mpisize;i++) {expected[0]+=(double)i; if(i!=0)expected[1]+=1.0/(double)i;}
  if (rank==0) printf("Expected: %f %f\n",expected[0],expected[1]);
  MPI_Finalize();
}

The program prints the expected result and the actual result for each process. With OMPI_MCA_coll_libnbc_iallreduce_algorithm=1 or 3 this gives incorrect results from 4 processes:

$ OMPI_MCA_coll_libnbc_iallreduce_algorithm=0 mpirun -n 4 ./a.out 
tmp:     1 6.000000 1.833333
tmp:     3 6.000000 1.833333
tmp:     0 6.000000 1.833333
Expected: 6.000000 1.833333
tmp:     2 6.000000 1.833333
$ OMPI_MCA_coll_libnbc_iallreduce_algorithm=1 mpirun -n 4 ./a.out 
tmp:     2 24.000000 0.000000
tmp:     0 24.000000 0.000000
Expected: 6.000000 1.833333
tmp:     1 24.000000 0.000000
tmp:     3 24.000000 0.000000
$ OMPI_MCA_coll_libnbc_iallreduce_algorithm=2 mpirun -n 4 ./a.out 
tmp:     3 6.000000 1.833333
tmp:     1 6.000000 1.833333
tmp:     2 6.000000 1.833333
tmp:     0 6.000000 1.833333
Expected: 6.000000 1.833333
$ OMPI_MCA_coll_libnbc_iallreduce_algorithm=3 mpirun -n 4 ./a.out 
tmp:     1 24.000000 0.000000
tmp:     3 24.000000 0.000000
tmp:     0 24.000000 0.000000
Expected: 6.000000 1.833333
tmp:     2 24.000000 0.000000
$ OMPI_MCA_coll_libnbc_iallreduce_algorithm=4 mpirun -n 4 ./a.out 
tmp:     2 6.000000 1.833333
tmp:     0 6.000000 1.833333
Expected: 6.000000 1.833333
tmp:     1 6.000000 1.833333
tmp:     3 6.000000 1.833333

ggouaillardet commented 2 years ago

@donners-atos I am not sure this is a bug:

Open MPI philosophy is a. it should work out of the box b. give the user what they explicitly request, regardless it is smart or not

In this case, the test a. passes with the default algorithm b. fails when an algorithm known not to support MPI_IN_PLACE is explicitly requested

@bosilca @jsquyres any thoughts?

donners-atos commented 2 years ago

ok, thanks. How can I know that these algorithms do not support MPI_IN_PLACE?

bosilca commented 2 years ago

Unfortunately there is no way. In OMPI design, each collective algorithm can decide to refuse to complete the collective, in which case it will fall back to a different one (according to registration priorities), with the worst case scenario of falling back to one of the basic algorithms in base or basic.

When you force a specific algorithm no such fallback exists, but if the selected algorithm refuses to complete the operations we should return the error up to the MPI API. I was under the impression that the code was doing this, but apparently it is not the case (maybe just for the non-blocking collectives).

jsquyres commented 2 years ago

@bosilca I made PRs for v5.0.x and v4.1.x, but ff03f43 doesn't apply cleanly to the v4.0.x tree -- there's a conflict. Could you have a look?

jsquyres commented 2 years ago

@donners-atos This appears to now be fixed. Can you check the latest v4.0.x and/or 4.1.x nightly snapshots?

https://www.open-mpi.org/nightly/v4.0.x/ https://www.open-mpi.org/nightly/v4.1.x/

donners-atos commented 2 years ago

thank you for the fix. I'll test it in the coming weeks.

bwbarrett commented 2 years ago

@donners-atos is this still an issue for you? If we don't hear back, I'll close the bug next week.

open-mpi / ompi