Closed wzamazon closed 1 year ago
https://github.com/open-mpi/ompi/pull/11752 PR try to address the issue.
I am writing a C reproducer to demonstrate the issue.
That's why the convertor is the right place to handle all these cases. I really wonder if this is really an optimization or it is just a way to create more headaches.
That's why the convertor is the right place to handle all these cases. I really wonder if this is really an optimization or it is just a way to create more headaches.
Main and 5.0.x is using convertor correctly, after some recent refactor.
the problem was introduced since 2.0.x by the following commit:
commit 56869bff38e264ee91ea68ae2fabfafe9456548e
Author: Jithin Jose <jithin.jose@intel.com>
Date: Thu May 7 14:52:46 2015 -0700
Avoid datatype pack/unpack for contiguous data on homogenous systems.
Signed-off-by: Jithin Jose <jithin.jose@intel.com>
and impact from 2.0.x to 4.1.x
PR has been merged and backported
Thank you for taking the time to submit an issue!
Background information
MPAS (https://mpas-dev.github.io/) is a weather forecasting application, and currently it will crash when using Open MPI's pml/cm + mtl/ofi code path, and the pml/ob1 + btl/tcp code path is working
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
built from source
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
I did some debug, and I believe it is caused by a bug in pml/cm's heavy send request initialization.
In function
MCA_PML_CM_SEND_REQUEST_INIT_COMMON
, there is a special branch that handle the case when mpi data type is continuouscode: https://github.com/open-mpi/ompi/blob/c053cb822770738de5df0a3e998899398d35e728/ompi/mca/pml/cm/pml_cm_sendreq.h#L232
This function
MCA_PML_CM_SEND_REQUEST_INIT_COMMON
is only applied to thin send request, heavy send request was initialized in a different functionMCA_PML_CM_HVY_SEND_REQUEST_INIT_COMMON
, and that function does not have the special treatment for data type with continuous data layout.code: https://github.com/open-mpi/ompi/blob/c053cb822770738de5df0a3e998899398d35e728/ompi/mca/pml/cm/pml_cm_sendreq.h#LL153C9-L153C48
The critical line is:
without this condition, heavy send request will be sending from a buffer without proper data offset (
datatype->super.true_lb
), and the receiver will receive wrong data.