pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
560 stars 279 forks source link

ch4/ucx: MPI_Win_allocate causing NWChem problems again #6110

Closed jeffhammond closed 2 years ago

jeffhammond commented 2 years ago

See https://github.com/nwchemgit/nwchem/issues/633#issuecomment-1211754575 for the initial report. This is very similar to what I saw that led to https://github.com/pmodels/mpich/commit/bad898f9df13b5060cbf43ee4acdb3b7b4c9a0f7.

I don't know when I'll have time to bisect this. Can somebody look at RMA and see if there were any nontrivial changes to shared-memory atomics recently?

jeffhammond commented 2 years ago

It was working as of v3.4.1 but is broken in 4.x branches currently tested by Debian: https://github.com/nwchemgit/nwchem/issues/633#issuecomment-1211931953.

hzhou commented 2 years ago

https://github.com/nwchemgit/nwchem/issues/633#issuecomment-1211754575 mentions this is a regression between v4.0.0 and v4.0.1 --

$ gitlog v4.0.1..v4.0.2
64a07944d5 03/29 17:13 Merge pull request #5916 from raffenet/4.0.2-version
43055d9d0c 03/29 09:56 Update version info for 4.0.2
7762795d55 03/29 09:41 Merge pull request #5868 from raffenet/4.0.x-fixes
99914087a5 03/25 12:53 modules: update yaksa
e85aae41c3 03/28 14:27 mtest: Add GPU flags to libmtest_cxx.la library
f172287e21 03/10 18:33 mpl: fix level zero properties initialization
dbcd7dda57 03/18 15:06 init: add explicit MPIR_pmi_barrier in init
338c3be330 03/07 11:56 romio_synchronized_sync: move barrier after fsync
b1ada30eab 03/15 22:24 ch4/posix: workaround for inter-process mutex on FreeBSD
a880832132 03/01 16:22 f08: add elemental to eq/neq operator functions
a9c1d2e693 02/08 17:52 test: permanently xfail collective length checks
f70e811780 02/16 14:58 mtest: Encapsulate GPU dependencies
ec8d02ce92 11/08 15:45 mtest: Convert utilities to libtool convenience libraries
dd93a149e8 02/15 13:19 ch3: Use MPIR_pmi_spawn_multiple
265273c9c3 02/16 10:31 mpir: Add pmi2 support in MPIR_pmi_spawn_multiple
9980487ef0 02/15 17:58 ch3: Define max jobid length in a single place
bc87c98e8e 02/15 17:57 ch3/sock: Remove unused header include
27605f9cbf 02/15 14:04 pm/hydra: Fix appnum retrieval in pmi2 server
d06952dc89 02/15 13:18 pmi2/simple: Fix PMI2_Nameserv_lookup
f67095200a 02/22 16:19 ch4: Workaround Intel compiler TLS bug on macOS
010c98527c 01/19 17:36 op: replace predefined datatype on input
a1f6564fd1 01/23 01:49 mpl: Use CPU affinity functions with standard names

The likely offending commit is "b1ada30eab 03/15 22:24 ch4/posix: workaround for inter-process mutex on FreeBSD" from https://github.com/pmodels/mpich/pull/5894

EDIT: Hmm, that commit only affects FreeBSD and shouldn't affect ucx build. I am at a loss.

hzhou commented 2 years ago

@jeffhammond I suspect this is potentially an issue with armci-mpi. Does it assume MPI_Win_allocate allocates shared memory on intra-node? ch4:ucx does not have shm enabled.

jeffhammond commented 2 years ago

It's not ARMCI-MPI. The code has not changed significantly in the past 8 years, and has been tested literally thousands of times with NWChem against every RMA implementation out there, usually in tortuous single-node circumstances. ARMCI-MPI works fine with many other versions of MPICH. The bug is in MPICH:Ch4:UCX or UCX.

jeffhammond commented 2 years ago

ARMCI-MPI does not assume MWA allocates shared-memory, because there is a no standard-compliant way to do that. I proposed a solution to that for the MPI standard.

jeffhammond commented 2 years ago

This, and the corresponding difference with free are the only differences in ARMCI-MPI when MWA is used. None of the communication code depends on this in any way.

  if (ARMCII_GLOBAL_STATE.use_win_allocate == 0) {

      if (local_size == 0) {
        alloc_slices[alloc_me].base = NULL;
      } else {
        MPI_Alloc_mem(local_size, alloc_shm_info, &(alloc_slices[alloc_me].base));
        ARMCII_Assert(alloc_slices[alloc_me].base != NULL);
      }
      MPI_Win_create(alloc_slices[alloc_me].base, (MPI_Aint) local_size, 1, MPI_INFO_NULL, group->comm, &mreg->window);
  }
  else if (ARMCII_GLOBAL_STATE.use_win_allocate == 1) {

      /* give hint to CASPER to avoid extra work for lock permission */
      if (alloc_shm_info == MPI_INFO_NULL)
          MPI_Info_create(&alloc_shm_info);
      MPI_Info_set(alloc_shm_info, "epochs_used", "lockall");

      MPI_Win_allocate( (MPI_Aint) local_size, 1, alloc_shm_info, group->comm, &(alloc_slices[alloc_me].base), &mreg->window);

      if (local_size == 0) {
        /* TODO: Is this necessary?  Is it a good idea anymore? */
        alloc_slices[alloc_me].base = NULL;
      } else {
        ARMCII_Assert(alloc_slices[alloc_me].base != NULL);
      }
  }
raffenet commented 2 years ago

Just to confirm @jeffhammond, I believe #6140 fixed this issue?

jeffhammond commented 2 years ago

That's what I've been told by Edo, who is authoritative on all NWChem issues.

raffenet commented 2 years ago

That's what I've been told by Edo, who is authoritative on all NWChem issues.

👍