open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

Is it possible for the target to service MPI_Get's in parallel? #9508

Open frizwi opened 3 years ago

frizwi commented 3 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Using MPI v4.1.1 on Linux

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source tarball, no other configure flags other than --prefix


Details of the problem

Its more of a question: with point-to-point communication, either the non-blocking sends or blocking sends in a thread each can be used to parallelise transfers but is there any way to setup/build/configure the OpenMPI library to do this at the target?

I have a master process and many workers, each worker process performs and MPI_Get with the master process as the target via passive synchronisation. The issue is that as the number of workers increases so does the walltime for the transfers, so I'd like the target to service the Get's in parallel threads - is that possible? With point-to-point comms, I could either use non-blocking sends or blocking sends in a parallel OpenMP sections to achieve this but not clear if I can do this in any way via RMA? Is there some trick that can be done with active synchronisation? The only other thing I can think of is to do an MPI_Put from the master in parallel - will that work?

Thanks in advance for any suggestions, or please point me to some doc, if available. The MPI spec. doesn't really address my issue so I'm hoping to get an understanding of how OpenMPI actually implements RMA to be able to fine tune my application

devreal commented 3 years ago

Both main implementations in OMPI will eventually have parallel transfers (assuming they are large enough that they can actually overlap). The specifics depend on your platform. If your network supports RDMA (IB, Cray Aries) then the hardware will most likely handle transfers in parallel. If not, then there may be a serialization in the starting of transfers by a thread at the target. Either way, there is nothing in the MPI standard that gives you control over it and the design of RMA is such that puts and gets from different origins are independent and can thus happen in parallel.

Some questions regarding the problem at hand: what size are your transfers? What network are you using? Are you building with UCX support? What does the master process do while the worker are getting their data? Can you rule out that the slow-down is caused by limited bandwidth?

frizwi commented 3 years ago

The master array size is in the order of 150,000 doubles, which is divvyed up more or less evenly amongst the workers. Indices are not necessarily consecutive so I create MPI_Datatypes using the array mappings on each of the origins. Yes, good point, I'm not actually building with UCX support. For this test case, I'm only using a small amount of workers, 2/4/8 and only on a 4 x CPU single node so was assuming that it would all be handled via shared mem efficiently. But maybe I should still build with UCX and Knem too. Will try this but it's probably best to move to our HPC cluster (IB) as that's really the aim here - to scale across many nodes.

The master basically does file i/o while the workers are doing the computations.

Thanks for the info, its good to know that on suitable hardware and OpenMPI configurations, parallel transfers can be had by the library without explicit handling in the application

devreal commented 3 years ago

Another aspect that is of importance: how do you allocate the window? Using MPI_Win_allocate is typically the most efficient option as it allows the MPI implementation to manage the memory (and allocate shared memory through mmap or sysv if possible). You wouldn't need knem or xpmem for shared memory support. Things are more difficult with MPI_Win_create (OMPI might use xpmem if available, not sure though) and dynamic windows are the worst in my experience.

If you're running on a single node and you're using MPI_Win_allocate, OMPI should use shared memory to perform the gets without requiring the target to get involved in the transfer. You can force OMPI to use the shared memory-only implementation by passing --mca osc sm to mpirun. If that fails for some reason, something is off with your setup or you're not using allocated windows and might fall back to a less efficient implementation. If it works and scaling is still not what you expect it's likely a bandwidth issue.

When you move to an IB-based cluster, I highly recommend you compile OMPI against UCX. Let us know if you're seeing performance issues there, too. In that case, more information on the datatype layout and/or a small reproducer would be helpful.

frizwi commented 3 years ago

Using MPI_Win_create to create Windows, saves on modifying hundreds of arrays in existing code that uses malloc. Application needs to work in non MPI mode too but will add wrapper functions if it turns out to be a big performance gain

Also a follow up question, on the worker nodes I'm doing something like this call err = MPI_Win_create(lpatm, szS*sizeof(double), sizeof(double), MPI_INFO_NULL, MPI_COMM_WORLD, &win); where lpatm is the receive buffer for each worker and used in the MPI_Get call. But strictly speaking, I don't need to expose the worker's memory so can just pass in NULL. However, is it doing any harm? I.e. any issue with it being copied into a public area in memory even though its never referenced by any other process?

If it helps, I'll try and simplify my code into a standalone example and post it here with some timing stats

devreal commented 3 years ago

Using MPI_Win_create to create Windows, saves on modifying hundreds of arrays in existing code that uses malloc. Application needs to work in non MPI mode too but will add wrapper functions if it turns out to be a big performance gain

That is a general problem with MPI RMA, it's not trivial to integrate with existing allocation structures. MPI_Win_create will certainly work with IB networks but shared memory is a little more challenging. I glanced at the rdma one-sided communication implementation in OMPI (enable with --mca osc rdma) and see traces of xpmem so you might want to try to build with xpmem support (requires the xpmem kernel module to work).

Also a follow up question, on the worker nodes I'm doing something like this call err = MPI_Win_create(lpatm, szS*sizeof(double), sizeof(double), MPI_INFO_NULL, MPI_COMM_WORLD, &win); where lpatm is the receive buffer for each worker and used in the MPI_Get call.

It shouldn't do any harm but the standard is not clear on whether passing NULL is permitted or not and it certainly does not mandate that the origin buffer is part of a window. By using a window location as origin buffer of a get or put you might save on the registration of memory with the NIC because it is likely already registered. Whether or not that is significant is hard to predict.

frizwi commented 3 years ago

Okay, thanks very much, this is certainly helpful info.

Here is some example code:

#define NSTEPS 5

/*
 * Entry function
 */
int main(int argc, char **argv)
{
  int infid, wpfid, acfid;
  char *infname, *wpfname;
  int isMaster;
  int err;
  int mpi_rank = 0;
  int mpi_size = 0;
  MPI_Win win = NULL;
  int gsz = 0;

  /* Initialise MPI */
  MPI_Init(&argc, &argv);  // xxx threaded, how?
  MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

  isMaster = (mpi_rank == 0);

  /* Check input args */
  if (argc != 3) {
    printf("\n");
    printf("Usage %s <input file> <win_map>\n", argv[0]);
    printf("\n");
    exit(1);
  }

  /* Cache input file name */
  infname = argv[1];
  wpfname = argv[2];

  if (isMaster) {
    geometry_t geom;
    double *patm;
    double *lons, *lats;
    size_t sz;
    int patm_id, lons_id, lats_id;
    int n, s;
    double *wtimes = d_alloc_1d(mpi_size);
    double *ltimes = d_alloc_1d(mpi_size);

    printf("\n");
    printf("========\n");
    printf("EMS sim2\n");
    printf("========\n");
    printf("\n");

    /* Initialise */
    memset(&geom, 0, sizeof(geometry_t));

    /* Open input file */
    err = nc_open(infname, NC_NOWRITE, &infid);
    NC_ERR(err);

    /* This is the whole sparse array */
    geom.b2_t = getDimLen(infid, "nMesh2_face");

    /* Global arrays */
    sz = geom.b2_t;
    patm = d_alloc_1d(sz+1);
    lons = d_alloc_1d(sz+1);
    lats = d_alloc_1d(sz+1);

    gsz = (int)sz;
    printf("Master gsz = %d\n", gsz);
    MPI_Bcast(&gsz, 1, MPI_INT, 0, MPI_COMM_WORLD);

    /* Get the spatial info */
    err = nc_get_var_double(infid, getVarid(infid, "Mesh2_face_x"), lons);
    NC_ERR(err);
    err = nc_get_var_double(infid, getVarid(infid, "Mesh2_face_y"), lats);
    NC_ERR(err);

    /* Initialise RMA window */
    err = MPI_Win_create(patm, sz*sizeof(double), sizeof(double),
             MPI_INFO_NULL, MPI_COMM_WORLD, &win);
    MPI_ERR(err);

    /* 
     * Simulate steps
     * 
     * In the real application, data would be read out of a netCDF file
     */
    for (s=0; s<NSTEPS; s++) {
      double dt = 1.0;

      for (n=1; n<=sz; n++)
    patm[n] = n + s*0.12;

      printf("Master step %d\n", s);

      /* Sync with slaves */
      MPI_Barrier(MPI_COMM_WORLD);

      /* Get timings */
      MPI_Gather(&dt, 1, MPI_DOUBLE, ltimes, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
      for (n=1; n<mpi_size; n++)
    wtimes[n] += ltimes[n];
    }
    fflush(stdout);
    sleep(1);

    printf("\nTiming stats:\n");
    /* Report timing - xxx better to do reduce? */
    for (n=1; n<mpi_size; n++)
      printf("%03d time = %f\n", n, wtimes[n]);

  } else {
    /*
     * Slave code
     */
    int nw = mpi_rank;
    geometry_t window;
    size_t start[2], count[2];
    int szS;
    double *lpatm = NULL;
    MPI_Datatype dtype;
    int s, n;
    size_t indexp = nw;
    FILE *f;
    char buf[256];

    /* Initialise */
    memset(&window, 0, sizeof(geometry_t));

    /* Open win_map file */
    err = nc_open(wpfname, NC_NOWRITE, &wpfid);
    NC_ERR(err);

    /* get b2_t, wsa info */
    err = nc_get_var1_int(wpfid, getVarid(wpfid, "b2_t_w"),
                &indexp, &window.b2_t);
    NC_ERR(err);

    /* Allocate local array for this window */
    szS = window.b2_t;
    lpatm = d_alloc_1d(szS+1);
    window.wsa = i_alloc_1d(szS+1);

    /* Get window local to global array mapping */
    start[0] = nw; start[1] = 1;
    count[0] = 1;  count[1] = szS;
    err = nc_get_vara_int(wpfid, getVarid(wpfid, "wsa"),
                start, count, window.wsa);
    NC_ERR(err);

    MPI_Bcast(&gsz, 1, MPI_INT, 0, MPI_COMM_WORLD);
    printf("%d:Slave szS = %d\n", mpi_rank, szS);
    /* Reset aux/ghost cells */
    for (n=0; n<szS; n++) {
      if (window.wsa[n] >= gsz) {
    // printf("Ooops! wn=%d, window.wsa[%d]=%d\n", nw, n, window.wsa[n]);
    window.wsa[n] = 1;
      }
    }

    /* Initialise RMA window */
    err = MPI_Win_create(lpatm, szS*sizeof(double), sizeof(double),
             MPI_INFO_NULL, MPI_COMM_WORLD, &win);
    MPI_ERR(err);

    /* Set up mapping */
    err = MPI_Type_create_indexed_block(szS, 1, window.wsa, MPI_DOUBLE, &dtype);
    MPI_ERR(err);
    err = MPI_Type_commit(&dtype);
    MPI_ERR(err);

    // sleep(5);
    //printf("Slave(%d) ready\n", nw);

    /* sleep to simulate loop and barrier */
    for (s=0; s<NSTEPS; s++) {
      double t0, t1, dt;

      /* Sync with master */
      MPI_Barrier(MPI_COMM_WORLD);

      //printf("(%d) before = %f\n", nw, lpatm[10]);

      t0 = MPI_Wtime();
      err = MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
      MPI_ERR(err);

      err = MPI_Get(lpatm, szS, MPI_DOUBLE, 0, 0, 1, dtype, win);
      MPI_ERR(err);

      err = MPI_Win_unlock(0, win);
      MPI_ERR(err);
      t1 = MPI_Wtime();
      dt = t1-t0;

      MPI_Gather(&dt, 1, MPI_DOUBLE, NULL, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

      /* test */
      //      for (n=0; n<szS; n++)
      //printf("(%d) after  = %f\n", nw, lpatm[10]);
    }

  }

  /* Handy for debugging */
  //  sleep(10);
  // MPI_Barrier(MPI_COMM_WORLD);

  MPI_Finalize();

  return(0);
}

And if I run it with 2 workers:

% mpirun -np 3 ./modelrun tile0b.nc ../compas-mpi/model/tests/setile/tile0_win_map_2W.nc.bak
========
EMS sim2
========
Master gsz = 145230
1:Slave szS = 72606
2:Slave szS = 72624
Master step 0
Master step 1
Master step 2
Master step 3
Master step 4

Timing stats:
001 time = 0.323801
002 time = 0.324102

But with 4 workers, it takes longer!

% mpirun -np 5 ./modelrun tile0b.nc ../compas-mpi/model/tests/setile/tile0_win_map_4W.nc

========
EMS sim2
========

Master gsz = 145230
2:Slave szS = 36320
4:Slave szS = 36307
1:Slave szS = 36291
3:Slave szS = 36312
Master step 0
Master step 1
Master step 2
Master step 3
Master step 4

Timing stats:
001 time = 0.593371
002 time = 0.556528
003 time = 0.592080
004 time = 0.602115

Note: This is still OpenMPI built without UCX or xpmem, so are these timings what you might expect?

devreal commented 3 years ago

Mhh, when breaking it down to (3 operations (lock, get, unlock) x 5 iterations x N worker) I end up with ~10ms per operation in both cases, which seems quite slow. I wanted to test your example both with and without xpmem but it doesn't seem to be self-contained so I fail to compile it, unfortunately. Have you had a chance to run with UCX and/or xpmem?

frizwi commented 3 years ago

You mean the above code is not self-contained? I've attached all the missing source and a Makefile (may need to tweak MPI paths)

ems-sim2-master@0a10f9c0f0b.zip

I tried UCX but the numbers are pretty much the same, probably because UCX wanted KNEM/XPMEM which I haven't installed yet. Do I get xpmem from Google Code?

Also, what about the mpirun command, do I need processor binding or something? Does OpenMPI use [p]threads or OpenMP - if the latter, should I be setting OMP_NUM_THREADS or some other affinity env?

yosefe commented 3 years ago

Do I get xpmem from Google Code?

The original xpmem was written by @hjelmn at https://github.com/hjelmn/xpmem In addition, UCX community maintains a fork of xpmem at https://github.com/openucx/xpmem I think they have almost same code now (except maybe adaptions for new kernels)

frizwi commented 3 years ago

Here are the data files need to run my code. The 3rd argument is bogus, not used data.tar.gz

Thanks @yosefe, I can't get autoconf to work but looks like the OpenMPI on our HPC cluster is built with KNEM, so will try that

frizwi commented 3 years ago

Okay, have tried running the code on a single node on our HPC cluster using OpenMPI 4.1.1 which is built with KNEM.

Here are the results:

[ffr599@gadi-cpu-clx-0971 ems-sim2]$ mpirun -np 3 ./modelrun in.nc win_map_2W.nc 
Master gsz = 145230
1:Slave szS = 72606
2:Slave szS = 72624
Master step 0
Master step 1
Master step 2
Master step 3
Master step 4

Timing stats:
001 time = 0.092372
002 time = 0.092501

[ffr599@gadi-cpu-clx-0971 ems-sim2]$ mpirun -np 5 ./modelrun in.nc win_map_4W.nc 
Master gsz = 145230
4:Slave szS = 36307
1:Slave szS = 36291
2:Slave szS = 36320
3:Slave szS = 36312
Master step 0
Master step 1
Master step 2
Master step 3
Master step 4

Timing stats:
001 time = 0.132498
002 time = 0.123287
003 time = 0.131582
004 time = 0.133267

[ffr599@gadi-cpu-clx-0971 ems-sim2]$ mpirun -np 9 ./modelrun in.nc win_map_8W.nc 
Master gsz = 145230
2:Slave szS = 18157
3:Slave szS = 18163
8:Slave szS = 18137
1:Slave szS = 18159
4:Slave szS = 18154
5:Slave szS = 18150
6:Slave szS = 18158
7:Slave szS = 18152
Master step 0
Master step 1
Master step 2
Master step 3
Master step 4

Timing stats:
001 time = 0.177935
002 time = 0.177614
003 time = 0.179040
004 time = 0.178401
005 time = 0.174531
006 time = 0.178017
007 time = 0.173686
008 time = 0.167491

The absolute timing's not as important as the lack of scaling. As the number of processes increases, the size of each MPI_Get decreases (by factors of 2,4 & 8), so I would expect the timing for transfers to also decrease but that is clearly not the case. The computation scales well but the overall walltime for the application doesn't improve due to the increase in transfer times. Am I not timing properly or have some other conceptual issue?

bosilca commented 3 years ago

Let me try to put in words what your application is doing. A set of worker processes will access the memory of the master process to retrieve some value. Each one of these processes will obtain a lock on the window prior to it's access, get the private value it needs and then unlock the window to pass it to some other process. Thus, you have a totally sequential access pattern, not because of the gets, but because the gets are protected by a lock.

Think about the same pattern between threads, where each thread access a memory location protected by a shared mutex. Can't really make a more sequential process. Thus, your code as is has 0 parallelism in the part you are provided here so far.

frizwi commented 3 years ago

Thanks, yes that makes complete sense. If the access from the worker processes were staggered so that only one worker was ever obtaining the lock only then we could expect the Get's to take less time as the size decreases. I guess, I was expecting the SHARED lock to service all requests concurrently.

I agree with your threads analogy except that in a shared memory environment I don't need locks, each thread can acess the global memory directly, as the integrity of the data is guaranteed by some other mechanism in an outer loop.

So my question then is, do I even need the LOCK here? In this case I could explicitly use shared mem but in general my workers will be spread right across the cluster.

Do Fences provide better parallelism?

Sent on the go with Vodafone Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: bosilca @.> Sent: Wednesday, October 27, 2021 3:59:04 AM To: open-mpi/ompi @.> Cc: Rizwi, Farhan (O&A, Hobart) @.>; Author @.> Subject: Re: [open-mpi/ompi] Is it possible for the target to service MPI_Get's in parallel? (#9508)

Let me try to put in words what your application is doing. A set of worker processes will access the memory of the master process to retrieve some value. Each one of these processes will obtain a lock on the window prior to it's access, get the private value it needs and then unlock the window to pass it to some other process. Thus, you have a totally sequential access pattern, not because of the gets, but because the gets are protected by a lock.

Think about the same pattern between threads, where each thread access a memory location protected by a shared mutex. Can't really make a more sequential process. Thus, your code as is has 0 parallelism in the part you are provided here so far.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/open-mpi/ompi/issues/9508#issuecomment-952132889, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGFCYE2EEEDLAZ3TG2WBCU3UI3M5RANCNFSM5FW6JQKA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

devreal commented 3 years ago

I haven't had time to test with the updated code. But: I don't think your processes are serialized, since it's a shared lock. Nevertheless, shared locks don't come for free, either. You can move the shared lock out of the loop and instead use MPI_Win_flush instead to wait for completion of the get. That will save you some overhead in the MPI library (esp on non-shared memory, non-RDMA platforms). Can you give that a try and see whether it changes the outcome?

frizwi commented 3 years ago

Hmmm...tried MPI_Win_flush as you suggested but it didn't make any difference, probably because the lock is still serialising the transfers.

The code/data packages above may not be complete, here's the link for the complete code + data: https://cloudstor.aarnet.edu.au/plus/s/wKf0fFEkiW6b0Ox

Also, I'm open to any suggestions for this. Basically, in my application the master will always be ahead of the workers in time, so what I'm looking for is a method whereby the master does what it needs to, places/flushes/pushes, the data somewhere and the workers come and collect the data at some point in the future. The master is guaranteed not to touch the data until all the workers have grabbed it. Depending on the load balancing, the workers may be sync'd.

In non-RMA, the master would use non-blocking sends and move on, thus achieving the desired outcome in terms of parallelism. What's the equivalent in RMA? Point-to-point is less desirable as it requires a stronger synchronisation between the master & workers as well as the knowledge of the offsets each worker requires, so more code.

Thanks for all your help

bosilca commented 3 years ago

There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw).

devreal commented 3 years ago

@frizwi I ran your example on an IB system and get the following runtimes:

2W:

001 time = 0.056773
002 time = 0.055926

4W:

001 time = 0.041799
002 time = 0.038467
003 time = 0.041928
004 time = 0.042871

8W:

Timing stats:
001 time = 0.024796
002 time = 0.025860
003 time = 0.026318
004 time = 0.025359
005 time = 0.024309
006 time = 0.025429
007 time = 0.024707
008 time = 0.023455

That seems to do be what you expect, right? Looking at your example, I would encourage you to move the shared locks out of the main loop and use MPI_Rget+MPI_Wait instead, as that saves a bit of overhead.

Bottom line: MPI_Win_create in combination with shared memory only works well under certain circumstances (e.g., when using xpmem) but it should work as expected on networks supporting RDMA, like IB. It's unfortunate that MPI currently won't tell you what it does under the hood and whether it can serve your gets efficiently or not...

frizwi commented 3 years ago

Thanks @devreal so it sounds like it is possible to get the scaling that I'm looking for, as per your post above. The bad news is that I'm unable to reproduce the scaling on my cluster. I've attached the output of ompi_info, if you could post yours, then I could try and get it built the same way. ompi-info.gadi.txt

Are you running it all on one node or distributed across the cluster. I was doing my experiments in an interactive job on the one node.

In any case, I've made the changes to using MPI_Rget+MPI_Wait, as per your suggestion like so (relevant section on the worker)

/* Assuming this lock is shared */
err = MPI_Win_lock_all(0, win);
MPI_ERR(err);

/* sleep to simulate loop and barrier */
for (s=0; s<NSTEPS; s++) {
   double t0, t1, dt;

    /* Sync with master */
    MPI_Barrier(MPI_COMM_WORLD);

    t0 = MPI_Wtime();

    err = MPI_Rget(lpatm, szS, MPI_DOUBLE, 0, 0, 1, dtype, win, &req);
    MPI_ERR(err);

    err = MPI_Wait(&req, MPI_STATUS_IGNORE);
    MPI_ERR(err);

    t1 = MPI_Wtime();
    dt = t1-t0;

    MPI_Gather(&dt, 1, MPI_DOUBLE, NULL, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
 } // for NSTEPS
err = MPI_Win_unlock_all(win);
MPI_ERR(err);

Is this what you mean?

So the Window is only ever locked once for the duration of each of the worker processes, and then Rget + Wait take care of the transfers, protection of the target's memory is handled by some other means, in this case MPI_Barrier

It's evident from your example that there is some parallelism/shared memory transfer going on so that gives me hope that with the right settings/build it can be done!

Just to recap, is this the best approach for the master->worker data transfers w/MPI RMA? It certainly fits in with the workflow that I have in my application.

frizwi commented 3 years ago

On a different cluster, I get these timings: 2W:

001 time = 0.019260
002 time = 0.019432

4W:

001 time = 0.014843
002 time = 0.013651
003 time = 0.014758
004 time = 0.016316

8W:

001 time = 0.010753
002 time = 0.010745
003 time = 0.011163
004 time = 0.010900
005 time = 0.010377
006 time = 0.011214
007 time = 0.010494
008 time = 0.010068

Not as good a scaling as yours, but atleast it's not descaling like in the other gadi cluster I reported about a week ago!

It's ompi_info is here: ompi-info.pearcey.txt

devreal commented 3 years ago

@frizwi Yes, the changes to the code are what I was thinking (although my understanding is that you're only ever reading from process with rank 0, so taking a shared shared for that process would be sufficient).

I could recreate the scaling on both shared memory and with a single process per node.

As I said earlier, MPI_Win_create won't work well on shared memory unless you've built either Open MPI or UCX against xpmem [1]. Both my Open MPI and UCX are built with xpmem support. If the xpmem kernel module is not installed on your system you would have to convince your admin to install it. Only then can a process access another process' memory directly (using knem [2] might work as well but knem doesn't seem to be installed on our systems) On multiple nodes, the IB hardware will take care of accessing memory on a different node. In both cases, you will get the desired behavior of transfers without requiring help from the target process.

[1] https://github.com/hjelmn/xpmem [2] https://knem.gitlabpages.inria.fr/doc/knem-api.html

frizwi commented 3 years ago

@devreal Yes, that's right, only rank 0 will ever write into the memory referenced by that win and all other ranks will only ever read. In MPI_Win_create rank > 0 passes NULL for the buffer. MPI_Win_lock seems to work as it's not a "real" lock in the mutex sense otherwise this wouldn't work. I'll only ever call it once at the start of the application for each worker rank.

Finally, any assert values that help here?

devreal commented 3 years ago

I cannot think of any info key that would apply here.

I would appreciate if you could keep us updated on whether performance meets your expectations once you run your application on the target cluster. In the mean-time, feel free to close this issue if you feel like it's been solved.

frizwi commented 3 years ago

Thanks @devreal, will keep you posted.

Just one last thing before I close this ticket, I have uploaded higher number or worker windows here: https://cloudstor.aarnet.edu.au/plus/s/TbRz4TcwEkKJNNc

So am curious how far we can take the scaling, 2-4-8 seems to scale nicely, as per your post earlier but how about 40-95 and up to 187? Could you try it on your cluster. Does ~100W get to Infiniband's theoretical limit?