Closed raffenet closed 1 year ago
The issue here is the cq data is not written into the cq in util_srx_peek
And currently data is not part of struct fi_peer_rx_entry
To fix this issue we need to add that field so util_srx_peek can retrieve such info and write to cq
@shijin-aws That seems reasonable to me. Would you like me to add it to the API and the util implementation?
It will be great if you could update that to API and util, I will send you a patch for the efa change(just need to add an argument in get_msg/get_tag call). We can make them in the same PR.
I'd like to back-up on this flow somewhat. MPI_Probe() is used prior to the app posting a receive buffer to get the message. It expects to determine the size of the message buffer that's needed. If we're dealing with a large transfer that uses a rendezvous protocol, the only data that MPI_Probe() will match with is some sort of ready-to-send message. Requiring that the CQ data be available at this time, prior to the actual message being sent, doesn't seem right. This is equivalent to requiring that the first X bytes of data be available and is forcing an implementation, including the wire protocol format.
Yes, we can modify struct fi_peer_rx_entry to include the CQ data. The larger question is should the guarantee be made that CQ data MUST be present and accessible in the first or only packet of a larger transfer?
I agree with Sean on this. However, the man page does say
If a peek request locates a matching message, the operation will complete successfully. The returned completion data will indicate the meta-data associated with the message, such as the message length, completion flags, available CQ data, tag, and source address. The data available is subject to the completion entry format (e.g. struct fi_cq_tagged_entry).
in this sense, MPICH isn't violating the man page?
Correct, MPICH isn't violating the man page, but the man page might be violating common sense. :) I think the data an app should be guaranteed to get using peek is the tag, size, and source address, but no actual data.
I still okay with modifying fi_peer_rx_entry, so we can get the current code to work. I'm less sure we want to enforce that requirement going forward.
@shefty I guess you wouldn't make this as a 1.19.0 blocker?
If MPICH will not work, then, yes, I'd like to have a fix for v1.19, especially if this could be considered a regression from MPICH's perspective. A fix doesn't seem that difficult, so I'd delay v1.19.0 until we have it.
@raffenet, does v1.18.x work?
Yes, I can confirm it is a regression compared to v1.18.x. And it broke after EFA starts using the util srx implementation.
I also tested the reproducer with Open MPI and it failed in the same way
@shefty @shijin-aws I'm almost done with a fix. I'll open a PR today for it
If MPICH will not work, then, yes, I'd like to have a fix for v1.19, especially if this could be considered a regression from MPICH's perspective. A fix doesn't seem that difficult, so I'd delay v1.19.0 until we have it.
@raffenet, does v1.18.x work?
Sorry, I was out for a few days. I will spin up an instance and try it today.
For context, the use of CQ data to include source information was added in order to support a larger user tag range for MPI applications. We can encode the source in the tag bits instead. It is supported in the code today, CQ data is just preferred. If we can't use CQ data because of a limitation in FI_PEEK
, we'd need a way to know that at init time so the library can adjust.
Describe the bug In MPICH, we transmit the source rank of pt2pt messages using CQ data when available. In testing MPICH+efa, we found that an
MPI_Probe
operation does not return the correct source rank in the MPI_Status object. The cause seems to be that the CQ data is not available for FI_PEEK operations processed inutil_srx_peek
.To Reproduce
MPIR_CVAR_NOLOCAL=1
.Expected behavior The above reproducer should complete normally without triggering either assertion.
Output With
FI_LOG_LEVEL=debug
, we see this output directly before the assert failure:Environment: provider: efa, endpoint type: rdm
Additional context cc: @shijin-aws @aingerson