Begin Step slower on large meshes (Benchmarking Adios)

abhishek1297 commented 4 months ago

We, Melissa, are currently trying to shift from zeromq implementation to Adios2 for leveraging RDMA support. After several issues we faced when using Adios SST, we were finally able to run it on Jean-zay which has a support of OmniPath network by setting some of the MPI variables as follows,

export OMPI_MCA_plm=slurm
export OMPI_MCA_mtl=psm2
export OMPI_MCA_osc=rdma
export OMPI_MCA_pml=ob1
export PSM2_MULTIRAIL=1 # required when the reader is on a different node.

These flag ensure that DataTransport=rdma is set properly and throw no runtime errors.

On jean-zay, OpenMPI PML is configured with cm (default) and ob1 options. If we use cm and pass DataTransport=rdma, Adios overrides this choice to evpath. When using ob1, it keeps rdma. So, we know RDMA is being used.

Setup

We are now benchmarking adios' communication latency against our zeromq implementation using a Heat PDE solver for sending 1000x1000 mesh for 100 timesteps.

We are running this benchmark on 3 nodes

10 Writers on two different nodes (5 writers each node). 4 MPI ranks per writer.
The reader on a single node without any other process running alongside. 1 MPI rank.

Problem

Strangely, Adios seems slower than the zmq implementation. The total time for the entire program is,

zmq=00:08:46
adios=00:15:09

Whatever post-reception processing is done is exactly the same in both zmq and adios. So, I did some profiling on the reader's receiving loop and found that the begin_step call is extremely slow in our case when the mesh is as large as 1000x1000, and is taking most of the execution time.

These are the logs of memory_profiler for functions within the receive() call sorted by their total execution time.

Mon Jul 22 14:02:45 2024    internode/nodes3/1000x1000/rdma_1K_internode/profiles/receive.prof

         449269 function calls (439269 primitive calls) in 906.311 seconds

   Ordered by: internal time
   List reduced from 191 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
=======================================================================
     1000  863.424    0.863  863.426    0.863 engine.py:49(begin_step)
=======================================================================
     1000   30.814    0.031   32.831    0.033 iterative_moments.py:31(increment)
     1000    3.874    0.004    3.876    0.004 engine.py:123(get)
     1000    2.017    0.002    2.017    0.002 iterative_moments.py:48(get_delta)
       10    1.431    0.143    1.431    0.143 io.py:245(open)
       10    1.251    0.125    1.251    0.125 engine.py:32(close)

I am assuming that major chunk of the time inside begin_step() is due to some MPI collective calls but it seems absurdly large (0.8 seconds per call) for RDMA. In the SstVerbose outputs, adios is picking the hfi_* OmniPath interfaces. So, we know that at least, it chooses RDMA correctly.

I did the same run but with a 100x100 mesh and begin_step time is negligible.

Mon Jul 22 15:38:01 2024    internode/nodes3/100x100/rdma_100_internode/profiles/receive.prof

         513337 function calls (503337 primitive calls) in 7.644 seconds

   Ordered by: internal time
   List reduced from 191 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    1.455    0.145    1.455    0.146 io.py:245(open)
     1000    1.422    0.001    1.425    0.001 engine.py:123(get)
       10    1.103    0.110    1.103    0.110 engine.py:32(close)
11000/1000    0.427    0.000    0.760    0.001 arrayprint.py:789(recurser)
     1000    0.223    0.000    3.592    0.004 base_server.py:668(get_data)
     1000    0.207    0.000    0.225    0.000 iterative_moments.py:31(increment)
     1000    0.181    0.000    0.519    0.001 arrayprint.py:934(fillFormat)
=======================================================================
     1000    0.152    0.000    0.153    0.000 engine.py:49(begin_step)
=======================================================================

_I am not really sure if this is hardware-specific or not but can you help us investigate what could potentially be a bottleneck inside begin_step calls ?_

I provide our reader-side code file base_server.py.

Line 828 receive(): The receiver loop going over all active sst files at the moment until all are writers have finished.
Line 614 check_data(): calls begin_step()
Line 662: get_data(): calls get() and end_step()

Extra

To calculate the communication time for each timestep I am doing the following,

engine.begin_step()
start = MPI.Wtime()
engine.get()
engine.end_step()
end = MPI.Wtime()
comm_time = end - start

_Should the begin_step call also be included in the communication time ?_

zmq is a bit different in the reception such that we send partial timesteps and then stitch the partial data before processing it. So, zmq communication time starts when the first partial timestep is received and ends when the entire timestep is stitched together.

eisenhauer commented 4 months ago

Hi Abishek. OK, lets try to sort what is going on. Probably we should talk through what BeginStep does and maybe we can figure out what is going on. We're talking about SST reader-side BeginStep here I think, so the first thing it does is wait until there is a timestep available for it to read. It looks like your code calls BeginStep with the default timeout parameter, which means that it will block a new incoming timestep is available. Upon receipt of timestep metadata by reader rank 0 (or immediately if it has already arrived), rank 0 then broadcasts the metadata to the other ranks. This may not be an issue for you if your receiver is part of a large collective and the metadata is large. The last thing that BeginStep does it to "install" the received metadata on each reader rank. This a pretty quick process in BP5, but it could be expensive if your metadata is complex. (That is, if you have a lot of variables, blocks, etc. Big data doesn't necessarily mean big metadata. You can write one 1K block or one 1 100Gb block and the metadata size is pretty much the same, but if you write 100,000 1K blocks or 100,000 variables, this is different.)

I haven't gone through you code enough to see if I can guess how big your metadata might be, but that's maybe something you can tell me off the top of your head. I'm guessing that at the scales you're probably running at, the MPI_Broadcast is probably not the problem, so I'd be inclined to think about whether or not BeginStep is waiting for data. It looks like you have a number of input engines and you're calling BeginStep with the default timeout, so when you call engine_dict[sim_id].begin_step(), that call will block until engine_dict[sim_id] has finished whatever it is doing, called end_step() and sent us metadata. There is a timeout of .5sec commented out there and based upon the way check_data is used, you might want to uncomment that, or even reduce it to 0sec. Zero sec will essentially do a poll(). Metadata arrived asynchronously (received by a background network handler thread over TCP (not rdma because it's usually smaller than data)), and begin_step() with a zero timeout will simply check to see if the metadata has already arrived and return StepStatus.NotReady if it hasn't. You can check again later to see if it's arrived by then, but in the meantime you can check your other incoming connections.

I'm not quite sure I understand what all is going on in get_data() yet, but lets tackle the begin_step() problem first. Does what I've said above make sense? Possibly at the larger scales your writers are taking longer to produce the data, so the reader is having to wait more for the data and that shows up in begin_step() because that's where the waiting happens. Doesn't really consume CPU on the reader, but you might need to tweak the timeout parameter to fix it. At least that's my take given my limited understanding of what you're doing, but if I've misunderstood, then lets dig further.

abhishek1297 commented 4 months ago

Hi again, Thanks for a detailed answer.

You are right. It is better to have non-blocking begin_step() on the reader side. But, even with the timeoutSeconds=0, overall execution time still remains similar. It seems I was too hung up on the reader-side latency issue that I overlooked the execution time of the writers side. Each iteration on the writer side is solving for Au=F which should roughly take ~2.5 seconds/timestep and then few milliseconds for sending the data.

Here is the code for writer side,

heat.c
melissa_api.c (just a wrapper for adios C api)

Summary of the writer-side code,

Split the communicator based on MPI_APPNUM. In these runs, I am not launching multiple MPI specs. Therefore, the communicator will be duplicated only. comm_app = MPI_COMM_WORLD
Convert it into a Fortran communicator using MPI_Comm_c2f(). The solver code calls fortran routines.
melissa_* functions are just wrappers for adios2_c API. We initialize Adios with comm_app communicator.
Solving Au=F and sending u.

Below, I am printing duration for the solver Au=F and sending begin_step -> put -> end_step at each iteration.

In case of Adios, the solver time increases to ~4.5 seconds/timestep when in fact we are using the same solver that is used with zmq. Only thing that changes is the sending of a mesh with Adios.

Iteration 0: solver duration=4.626270, send duration=0.002057
Iteration 1: solver duration=4.567420, send duration=0.001293
Iteration 2: solver duration=4.568082, send duration=0.000520
Iteration 3: solver duration=4.547276, send duration=0.001365
Iteration 4: solver duration=4.544529, send duration=0.000505
Iteration 5: solver duration=4.530054, send duration=0.001410

When commenting only data transfer begin_step -> put -> end_step part of the loop. Time reduces by ~0.3-0.5 seconds/timestep.

Iteration 0: solver duration=4.195277
Iteration 1: solver duration=4.234625
Iteration 2: solver duration=4.261027
Iteration 3: solver duration=4.025121
Iteration 4: solver duration=4.312538
Iteration 5: solver duration=4.288451

When commenting both init(comm) & define_variable as well as begin_step() -> put() -> end_step(), the solver time goes back to ~2.5 seconds,

Iteration 0: solver duration=2.670431
Iteration 1: solver duration=2.643374
Iteration 2: solver duration=2.622191
Iteration 3: solver duration=2.622029
Iteration 4: solver duration=2.622040
Iteration 5: solver duration=2.622391

MPI communicator might have something do with this. Because even after commenting the mesh sending part of the loop, solver time did not decrease as much. So, I think something's up after adios2_mpi_init(comm) is called and that is causing this slow down.

There was another issue probably because of the communicator. I posted an issue we were getting when meshes become large. So, on bigger meshes, the same conjgrad fortran subroutine (from the file above) was deadlocking on MPI_Send with Adios. I was able to resolve it by replacing it with a buffered MPI_BSend. Though, I first tried with non-blocking MPI_ISend, But I saw garbage (NaN) values being sent via Adios. So, I used a buffered send instead. But, this was happening on OmniPath network which been nothing short of a headache to run with. So, I cannot be certain on what is the root cause.

ornladios / ADIOS2