Open abhishek1297 opened 4 months ago
Hi Abishek. OK, lets try to sort what is going on. Probably we should talk through what BeginStep does and maybe we can figure out what is going on. We're talking about SST reader-side BeginStep here I think, so the first thing it does is wait until there is a timestep available for it to read. It looks like your code calls BeginStep with the default timeout parameter, which means that it will block a new incoming timestep is available. Upon receipt of timestep metadata by reader rank 0 (or immediately if it has already arrived), rank 0 then broadcasts the metadata to the other ranks. This may not be an issue for you if your receiver is part of a large collective and the metadata is large. The last thing that BeginStep does it to "install" the received metadata on each reader rank. This a pretty quick process in BP5, but it could be expensive if your metadata is complex. (That is, if you have a lot of variables, blocks, etc. Big data doesn't necessarily mean big metadata. You can write one 1K block or one 1 100Gb block and the metadata size is pretty much the same, but if you write 100,000 1K blocks or 100,000 variables, this is different.)
I haven't gone through you code enough to see if I can guess how big your metadata might be, but that's maybe something you can tell me off the top of your head. I'm guessing that at the scales you're probably running at, the MPI_Broadcast is probably not the problem, so I'd be inclined to think about whether or not BeginStep is waiting for data. It looks like you have a number of input engines and you're calling BeginStep with the default timeout, so when you call engine_dict[sim_id].begin_step(), that call will block until engine_dict[sim_id] has finished whatever it is doing, called end_step() and sent us metadata. There is a timeout of .5sec commented out there and based upon the way check_data is used, you might want to uncomment that, or even reduce it to 0sec. Zero sec will essentially do a poll(). Metadata arrived asynchronously (received by a background network handler thread over TCP (not rdma because it's usually smaller than data)), and begin_step() with a zero timeout will simply check to see if the metadata has already arrived and return StepStatus.NotReady if it hasn't. You can check again later to see if it's arrived by then, but in the meantime you can check your other incoming connections.
I'm not quite sure I understand what all is going on in get_data() yet, but lets tackle the begin_step() problem first. Does what I've said above make sense? Possibly at the larger scales your writers are taking longer to produce the data, so the reader is having to wait more for the data and that shows up in begin_step() because that's where the waiting happens. Doesn't really consume CPU on the reader, but you might need to tweak the timeout parameter to fix it. At least that's my take given my limited understanding of what you're doing, but if I've misunderstood, then lets dig further.
Hi again, Thanks for a detailed answer.
You are right. It is better to have non-blocking begin_step()
on the reader side. But, even with the timeoutSeconds=0
, overall execution time still remains similar. It seems I was too hung up on the reader-side latency issue that I overlooked the execution time of the writers side. Each iteration on the writer side is solving for Au=F
which should roughly take ~2.5 seconds/timestep and then few milliseconds for sending the data.
Here is the code for writer side,
Summary of the writer-side code,
comm_app = MPI_COMM_WORLD
MPI_Comm_c2f()
. The solver code calls fortran routines.melissa_*
functions are just wrappers for adios2_c
API. We initialize Adios with comm_app
communicator.Au=F
and sending u
.Below, I am printing duration for the solver Au=F
and sending begin_step -> put -> end_step
at each iteration.
In case of Adios, the solver time increases to ~4.5 seconds/timestep when in fact we are using the same solver that is used with zmq. Only thing that changes is the sending of a mesh with Adios.
Iteration 0: solver duration=4.626270, send duration=0.002057
Iteration 1: solver duration=4.567420, send duration=0.001293
Iteration 2: solver duration=4.568082, send duration=0.000520
Iteration 3: solver duration=4.547276, send duration=0.001365
Iteration 4: solver duration=4.544529, send duration=0.000505
Iteration 5: solver duration=4.530054, send duration=0.001410
When commenting only data transfer begin_step -> put -> end_step
part of the loop. Time reduces by ~0.3-0.5 seconds/timestep.
Iteration 0: solver duration=4.195277
Iteration 1: solver duration=4.234625
Iteration 2: solver duration=4.261027
Iteration 3: solver duration=4.025121
Iteration 4: solver duration=4.312538
Iteration 5: solver duration=4.288451
When commenting both init(comm) & define_variable
as well as begin_step() -> put() -> end_step()
, the solver time goes back to ~2.5 seconds,
Iteration 0: solver duration=2.670431
Iteration 1: solver duration=2.643374
Iteration 2: solver duration=2.622191
Iteration 3: solver duration=2.622029
Iteration 4: solver duration=2.622040
Iteration 5: solver duration=2.622391
MPI communicator might have something do with this. Because even after commenting the mesh sending part of the loop, solver time did not decrease as much. So, I think something's up after adios2_mpi_init(comm)
is called and that is causing this slow down.
There was another issue probably because of the communicator. I posted an issue we were getting when meshes become large. So, on bigger meshes, the same conjgrad fortran subroutine (from the file above) was deadlocking on MPI_Send
with Adios. I was able to resolve it by replacing it with a buffered MPI_BSend
. Though, I first tried with non-blocking MPI_ISend
, But I saw garbage (NaN) values being sent via Adios. So, I used a buffered send instead. But, this was happening on OmniPath network which been nothing short of a headache to run with. So, I cannot be certain on what is the root cause.
We, Melissa, are currently trying to shift from zeromq implementation to Adios2 for leveraging RDMA support. After several issues we faced when using Adios SST, we were finally able to run it on Jean-zay which has a support of OmniPath network by setting some of the MPI variables as follows,
These flag ensure that
DataTransport=rdma
is set properly and throw no runtime errors.On jean-zay, OpenMPI PML is configured with
cm
(default) andob1
options. If we usecm
and passDataTransport=rdma
, Adios overrides this choice toevpath
. When usingob1
, it keepsrdma
. So, we know RDMA is being used.Setup
We are now benchmarking adios' communication latency against our zeromq implementation using a Heat PDE solver for sending 1000x1000 mesh for 100 timesteps.
We are running this benchmark on 3 nodes
Problem
Strangely, Adios seems slower than the zmq implementation. The total time for the entire program is,
Whatever post-reception processing is done is exactly the same in both zmq and adios. So, I did some profiling on the reader's receiving loop and found that the
begin_step
call is extremely slow in our case when the mesh is as large as 1000x1000, and is taking most of the execution time.These are the logs of
memory_profiler
for functions within thereceive()
call sorted by their total execution time.I am assuming that major chunk of the time inside
begin_step()
is due to some MPI collective calls but it seems absurdly large (0.8 seconds per call) for RDMA. In theSstVerbose
outputs, adios is picking thehfi_*
OmniPath interfaces. So, we know that at least, it chooses RDMA correctly.I did the same run but with a 100x100 mesh and
begin_step
time is negligible._I am not really sure if this is hardware-specific or not but can you help us investigate what could potentially be a bottleneck inside
begin_step
calls ?_I provide our reader-side code file base_server.py.
receive()
: The receiver loop going over all active sst files at the moment until all are writers have finished.check_data()
: callsbegin_step()
get_data()
: callsget()
andend_step()
Extra
To calculate the communication time for each timestep I am doing the following,
_Should the
begin_step
call also be included in the communication time ?_zmq is a bit different in the reception such that we send partial timesteps and then stitch the partial data before processing it. So, zmq communication time starts when the first partial timestep is received and ends when the entire timestep is stitched together.