Open a-szegel opened 1 year ago
I thought this sounded vaguely familiar.
This was never ported back to v4.0.x from my eyes:
https://github.com/open-mpi/ompi/commit/dc8ead901ef63ddca6aa2354d1fc5a77c1131580
can you try applying this to you branch and retrying?
It may not apply cleanly, but on the v4.0.x branch It is here:
https://github.com/open-mpi/ompi/blob/v4.0.x/ompi/mca/osc/rdma/osc_rdma_accumulate.c#L773
Well it applied cleaner than I thought, here's a branch to try:
https://github.com/open-mpi/ompi/compare/v4.0.x...awlauria:ompi:rdma_potential_unaligned_mem_v4.0.x
If it works I can open a pr, though v4.0.x is long in the tooth and the RM's may be wary in taking it.
If nothing else it can go into v4.1.x (the fix also didn't get ported there).
Edit - actually looking at the stack traces this will most likely not fix it. But it's possible something similar needs to be done. Sorry. :(
Background information
AWS was looking at AWS MTT issues and notices that on ARM arch'es, IBM benchmark
ibm/onesided/c_reqops
was failing due to:I was able to isolate this to not include EFA.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.x
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone:
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem