Please describe the system on which you are running
N/A
Operating system/version:
Computer hardware:
Network type:
Details of the problem
The upper 2 bits of an ompi tag encode the synchronize send and synchronize send ack. Because the mtl_ofi_create_recv_tag_CQD and mtl_ofi_create_recv_tag functions both use ompi_mtl_ofi.sync_proto_mask instead of
ompi_mtl_ofi.sync_send when generating their "ignore" masks, the recv tag-matching logic will disregard the ack bit so that it may match a tag that has the ack bit set.
This is an issue because ssend is implemented by doing a send and receive internally. So if there happens to be an outstanding receive posted by a user before an ssend, that user's receive may end up consuming the internal message intended for the ssend's internal receive.
Updating mtl_ofi_create_recv_tag_CQD and mtl_ofi_create_recv_tag functions to use ompi_mtl_ofi.sync_send fixes this.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
master branch top of tree: commit eca00a7a3b179
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.$ git submodule status +299d2a489aa53546e1320eb3fd7e8d726f16b251 opal/mca/hwloc/hwloc2/hwloc (dev-3067-g299d2a4) +ee72a2b65b1b6480753fc12d500c51ebe4fc23aa opal/mca/pmix/pmix4x/openpmix (v1.1.3-2505-gee72a2b) +545863e6dc055233456116da6dc85be2b307f8e2 prrte (dev-30707-g545863e)
Please describe the system on which you are running
N/A
Details of the problem
The upper 2 bits of an ompi tag encode the synchronize send and synchronize send ack. Because the mtl_ofi_create_recv_tag_CQD and mtl_ofi_create_recv_tag functions both use ompi_mtl_ofi.sync_proto_mask instead of ompi_mtl_ofi.sync_send when generating their "ignore" masks, the recv tag-matching logic will disregard the ack bit so that it may match a tag that has the ack bit set.
This is an issue because ssend is implemented by doing a send and receive internally. So if there happens to be an outstanding receive posted by a user before an ssend, that user's receive may end up consuming the internal message intended for the ssend's internal receive.
Updating mtl_ofi_create_recv_tag_CQD and mtl_ofi_create_recv_tag functions to use ompi_mtl_ofi.sync_send fixes this.
For example, consider the following:
If run with a debug build, that code will produce the following failed assertion:
Updating mtl_ofi_create_recv_tag_CQD and mtl_ofi_create_recv_tag functions to both use ompi_mtl_ofi.sync_send fixes this: