Closed ggouaillardet closed 5 years ago
Here is a possible fix.
The right fix might be to correctly set the peer flags so ompi_osc_rdma_peer_local_base(peer)
always returns true
when peer
is the current MPI task.
diff --git a/ompi/mca/osc/rdma/osc_rdma_comm.c b/ompi/mca/osc/rdma/osc_rdma_comm.c
index 1c16676..011b7c2 100644
--- a/ompi/mca/osc/rdma/osc_rdma_comm.c
+++ b/ompi/mca/osc/rdma/osc_rdma_comm.c
@@ -3,8 +3,8 @@
* Copyright (c) 2014-2018 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2016-2018 Intel, Inc. All rights reserved.
- * Copyright (c) 2017 Research Organization for Information Science
- * and Technology (RIST). All rights reserved.
+ * Copyright (c) 2017-2019 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* Copyright (c) 2017 IBM Corporation. All rights reserved.
* $COPYRIGHT$
*
@@ -795,7 +795,7 @@ static inline int ompi_osc_rdma_get_w_req (ompi_osc_rdma_sync_t *sync, void *ori
}
/* optimize self/local communication */
- if (ompi_osc_rdma_peer_local_base (peer)) {
+ if (ompi_osc_rdma_peer_local_base (peer) || (ompi_group_rank(module->win->w_group) == peer->rank)) {
return ompi_osc_rdma_copy_local ((void *) (intptr_t) source_address, source_count, source_datatype,
origin_addr, origin_count, origin_datatype, request);
}
Here is an other possible fix
diff --git a/ompi/mca/osc/rdma/osc_rdma_component.c b/ompi/mca/osc/rdma/osc_rdma_component.c
index d379350..5afac94 100644
--- a/ompi/mca/osc/rdma/osc_rdma_component.c
+++ b/ompi/mca/osc/rdma/osc_rdma_component.c
@@ -19,6 +19,8 @@
* Copyright (c) 2016-2017 IBM Corporation. All rights reserved.
* Copyright (c) 2018 Cisco Systems, Inc. All rights reserved
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
+ * Copyright (c) 2019 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -722,7 +724,13 @@ static int allocate_state_shared (ompi_osc_rdma_module_t *module, void **base, s
ompi_osc_module_add_peer (module, peer);
- if (MPI_WIN_FLAVOR_DYNAMIC == module->flavor || 0 == temp[i].size) {
+ if (MPI_WIN_FLAVOR_DYNAMIC == module->flavor) {
+ if (module->use_cpu_atomics && peer_rank == my_rank) {
+ peer->flags |= OMPI_OSC_RDMA_PEER_LOCAL_BASE;
+ }
+ /* nothing more to do */
+ continue;
+ } else if (0 == temp[i].size) {
/* nothing more to do */
continue;
}
@ggouaillardet The second fix is the correct one. Another case that wasn't well tested. Need an MTT test for get from self in the dynamic case.
thanks, I will PR for master
and then v4.0.x
and add the test into the ibm
test suite.
the fix has been merged into master and v4.0.x so closing.
This was initially reported by Bart Janssens at https://www.mail-archive.com/users@lists.open-mpi.org//msg33035.html
The minimal program below crashes when ran with more than one MPI task when
osc/rdma
is used