ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
573 stars 381 forks source link

prov/efa: cq data not available with FI_PEEK #9227

Closed raffenet closed 1 year ago

raffenet commented 1 year ago

Describe the bug In MPICH, we transmit the source rank of pt2pt messages using CQ data when available. In testing MPICH+efa, we found that an MPI_Probe operation does not return the correct source rank in the MPI_Status object. The cause seems to be that the CQ data is not available for FI_PEEK operations processed in util_srx_peek.

To Reproduce

  1. configure and build MPICH 4.1.2 with libfabric main branch or v1.19.0rc1. EDIT: adding additional configure flags to speed up build time:
    ./configure --with-device=ch4:ofi --with-libfabric=</path/to/install> --with-datatype-engine=dataloop --disable-cxx --disable-fortran
  2. Build and run this MPI reproducer using MPICH+efa. This can be done on a single node with the environment variable MPIR_CVAR_NOLOCAL=1.
    mpicc foo.c
    MPIR_CVAR_NOLOCAL=1 mpiexec -n 2 ./a.out
#include <mpi.h>
#include <assert.h>
#include <stdlib.h>

int main(void) {
  MPI_Init(NULL, NULL);

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (rank == 1) {
    MPI_Send(NULL, 0, MPI_DATATYPE_NULL, 0, 0, MPI_COMM_WORLD);
  }

  if (rank == 0) {
    MPI_Status status;
    MPI_Probe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);
    assert(status.MPI_SOURCE == 1);
    MPI_Recv(NULL, 0, MPI_DATATYPE_NULL, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);
    assert(status.MPI_SOURCE == 1);
  }

  MPI_Finalize();
  return 0;
}

Expected behavior The above reproducer should complete normally without triggering either assertion.

Output With FI_LOG_LEVEL=debug, we see this output directly before the assert failure:

libfabric:52078:1692287803::efa:ep_data:efa_rdm_msg_generic_recv():926<debug> efa_rdm_msg_generic_recv: iov_len: 0 tag: 0 ignore: 60000000000000 op: 1 flags: 1080000
libfabric:52078:1692287803::core:ep_ctrl:util_srx_peek():620<debug> Message not found
libfabric:52079:1692287803::efa:ep_data:efa_rdm_msg_generic_recv():926<debug> efa_rdm_msg_generic_recv: iov_len: 0 tag: 0 ignore: 6000007fffffff op: 1 flags: 1080000
libfabric:52079:1692287803::core:ep_ctrl:util_srx_peek():620<debug> Message not found
libfabric:52078:1692287803::efa:ep_data:efa_rdm_msg_generic_recv():926<debug> efa_rdm_msg_generic_recv: iov_len: 0 tag: 0 ignore: 60000000000000 op: 1 flags: 1080000
libfabric:52078:1692287803::core:ep_ctrl:util_srx_peek():623<debug> Message found
a.out: foo.c:18: main: Assertion `status.MPI_SOURCE == 1' failed.

Environment: provider: efa, endpoint type: rdm

Additional context cc: @shijin-aws @aingerson

shijin-aws commented 1 year ago

The issue here is the cq data is not written into the cq in util_srx_peek

https://github.com/ofiwg/libfabric/blob/583c1c495a1491cb1874b073845e45ce2636dc9c/prov/util/src/util_srx.c#L638-L640

And currently data is not part of struct fi_peer_rx_entry

To fix this issue we need to add that field so util_srx_peek can retrieve such info and write to cq

aingerson commented 1 year ago

@shijin-aws That seems reasonable to me. Would you like me to add it to the API and the util implementation?

shijin-aws commented 1 year ago

It will be great if you could update that to API and util, I will send you a patch for the efa change(just need to add an argument in get_msg/get_tag call). We can make them in the same PR.

shefty commented 1 year ago

I'd like to back-up on this flow somewhat. MPI_Probe() is used prior to the app posting a receive buffer to get the message. It expects to determine the size of the message buffer that's needed. If we're dealing with a large transfer that uses a rendezvous protocol, the only data that MPI_Probe() will match with is some sort of ready-to-send message. Requiring that the CQ data be available at this time, prior to the actual message being sent, doesn't seem right. This is equivalent to requiring that the first X bytes of data be available and is forcing an implementation, including the wire protocol format.

Yes, we can modify struct fi_peer_rx_entry to include the CQ data. The larger question is should the guarantee be made that CQ data MUST be present and accessible in the first or only packet of a larger transfer?

shijin-aws commented 1 year ago

I agree with Sean on this. However, the man page does say

If a peek request locates a matching message, the operation will complete successfully. The returned completion data will indicate the meta-data associated with the message, such as the message length, completion flags, available CQ data, tag, and source address. The data available is subject to the completion entry format (e.g. struct fi_cq_tagged_entry).

in this sense, MPICH isn't violating the man page?

shefty commented 1 year ago

Correct, MPICH isn't violating the man page, but the man page might be violating common sense. :) I think the data an app should be guaranteed to get using peek is the tag, size, and source address, but no actual data.

I still okay with modifying fi_peer_rx_entry, so we can get the current code to work. I'm less sure we want to enforce that requirement going forward.

shijin-aws commented 1 year ago

@shefty I guess you wouldn't make this as a 1.19.0 blocker?

shefty commented 1 year ago

If MPICH will not work, then, yes, I'd like to have a fix for v1.19, especially if this could be considered a regression from MPICH's perspective. A fix doesn't seem that difficult, so I'd delay v1.19.0 until we have it.

@raffenet, does v1.18.x work?

shijin-aws commented 1 year ago

Yes, I can confirm it is a regression compared to v1.18.x. And it broke after EFA starts using the util srx implementation.

I also tested the reproducer with Open MPI and it failed in the same way

aingerson commented 1 year ago

@shefty @shijin-aws I'm almost done with a fix. I'll open a PR today for it

raffenet commented 1 year ago

If MPICH will not work, then, yes, I'd like to have a fix for v1.19, especially if this could be considered a regression from MPICH's perspective. A fix doesn't seem that difficult, so I'd delay v1.19.0 until we have it.

@raffenet, does v1.18.x work?

Sorry, I was out for a few days. I will spin up an instance and try it today.

For context, the use of CQ data to include source information was added in order to support a larger user tag range for MPI applications. We can encode the source in the tag bits instead. It is supported in the code today, CQ data is just preferred. If we can't use CQ data because of a limitation in FI_PEEK, we'd need a way to know that at init time so the library can adjust.