open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

master heterogeneous build broken #403

Closed jsquyres closed 9 years ago

jsquyres commented 9 years ago

As reported by @siegmargross in http://www.open-mpi.org/community/lists/users/2015/02/26343.php, --enable-heterogeneous builds on master are broken.

Here's the first error I run into on Linux (Siegmar first runs into problems in pml/cm):

  CC       btl_openib_put.lo
btl_openib_put.c: In function 'mca_btl_openib_put':
btl_openib_put.c:87:27: error: dereferencing pointer to incomplete type
     if ((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
                           ^
btl_openib_put.c:87:52: error: 'OPAL_ARCH_ISBIGENDIAN' undeclared (first use in this function)
     if ((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN)
                                                    ^
btl_openib_put.c:87:52: note: each undeclared identifier is reported only once for each function it appears in
bosilca commented 9 years ago

The "util/arch.h" include is missing.

On Thu, Feb 19, 2015 at 8:52 AM, Jeff Squyres notifications@github.com wrote:

As reported by @siegmargross https://github.com/siegmargross in http://www.open-mpi.org/community/lists/users/2015/02/26343.php, --enable-heterogeneous builds on master are broken.

Here's the first error I run into on Linux (Siegmar first runs into problems in pml/cm):

CC btl_openib_put.lo btl_openib_put.c: In function 'mca_btl_openib_put': btl_openib_put.c:87:27: error: dereferencing pointer to incomplete type if ((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN) ^ btl_openib_put.c:87:52: error: 'OPAL_ARCH_ISBIGENDIAN' undeclared (first use in this function) if ((ep->endpoint_proc->proc_opal->proc_arch & OPAL_ARCH_ISBIGENDIAN) ^ btl_openib_put.c:87:52: note: each undeclared identifier is reported only once for each function it appears in

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/403.

jsquyres commented 9 years ago

I'm quite sure Mellanox doesn't care about openib any more, but I'll ping @jladd-mlnx and @miked-mellanox just to make sure they're aware.

@hjelmn @hppritcha I think you guys are shepherding the openib BTL these days -- can you have a look?

ggouaillardet commented 9 years ago

i fix the issue related to btl/openib in open-mpi/ompi@f43b5b46ee745e27ead0e0cbe95fff2a2f5c1c9f

there is an other issue with pml/ob1 i am working on FWIW, here is where i am :

diff --git a/ompi/mca/pml/ob1/pml_ob1.c b/ompi/mca/pml/ob1/pml_ob1.c
index c7eaef5..701b46c 100644
--- a/ompi/mca/pml/ob1/pml_ob1.c
+++ b/ompi/mca/pml/ob1/pml_ob1.c
@@ -16,7 +16,9 @@
  * Copyright (c) 2011      Sandia National Laboratories. All rights reserved.
  * Copyright (c) 2011-2015 Los Alamos National Security, LLC. All rights
  *                         reserved.
- * Copyright (c) 2012 Cisco Systems, Inc.  All rights reserved.
+ * Copyright (c) 2012      Cisco Systems, Inc.  All rights reserved.
+ * Copyright (c) 2015      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -661,7 +663,8 @@ int mca_pml_ob1_send_fin( ompi_proc_t* proc,
     mca_pml_ob1_fin_hdr_prepare ((mca_pml_ob1_fin_hdr_t *) fin->des_segments->seg_addr.pval,
                                  0, hdr_frag.lval, status ? status : (int64_t) rdma_size);

-    ob1_hdr_hton(hdr, MCA_PML_OB1_HDR_TYPE_FIN, proc);
+    /* FIXME it compiles ... but does it work ? */
+    ob1_hdr_hton((mca_pml_ob1_fin_hdr_t *)fin, MCA_PML_OB1_HDR_TYPE_FIN, proc);

     /* queue request */
     rc = mca_bml_base_send( bml_btl, fin, MCA_PML_OB1_HDR_TYPE_FIN );
diff --git a/ompi/mca/pml/ob1/pml_ob1_hdr.h b/ompi/mca/pml/ob1/pml_ob1_hdr.h
index e53f4af..9d7d90f 100644
--- a/ompi/mca/pml/ob1/pml_ob1_hdr.h
+++ b/ompi/mca/pml/ob1/pml_ob1_hdr.h
@@ -13,6 +13,8 @@
  * Copyright (c) 2009      IBM Corporation.  All rights reserved.
  * Copyright (c) 2012-2014 Los Alamos National Security, LLC. All rights
  *                         reserved.
+ * Copyright (c) 2015      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -198,17 +200,21 @@ static inline void mca_pml_ob1_rget_hdr_prepare (mca_pml_ob1_rget_hdr_t *hdr, ui
     memcpy (hdr + 1, local_handle, local_handle_size);
 }

+/* FIXME
+ * need to convert (h).hdr_frag
+ * can we simply assume this is a uint64_t ?
+ */
 #define MCA_PML_OB1_RGET_HDR_NTOH(h)                    \
     do {                                                \
         MCA_PML_OB1_RNDV_HDR_NTOH((h).hdr_rndv);        \
-        (h).hdr_seg_cnt = ntohl((h).hdr_seg_cnt);       \
+        (h).hdr_frag.lval = ntoh64((h).hdr_frag.lval);  \
         (h).hdr_src_ptr = ntoh64((h).hdr_src_ptr);      \
     } while (0)

 #define MCA_PML_OB1_RGET_HDR_HTON(h)                    \
     do {                                                \
         MCA_PML_OB1_RNDV_HDR_HTON((h).hdr_rndv);        \
-        (h).hdr_seg_cnt = htonl((h).hdr_seg_cnt);       \
+        (h).hdr_frag.lval = hton64((h).hdr_frag.lval);  \
         (h).hdr_src_ptr = hton64((h).hdr_src_ptr);      \
     } while (0) 

@@ -351,10 +357,16 @@ static inline void mca_pml_ob1_rdma_hdr_prepare (mca_pml_ob1_rdma_hdr_t *hdr, ui
     memcpy (hdr + 1, local_handle, local_handle_size);
 }

+/* FIXME
+ * need to convert (h).hdr_req, (h).hdr_frag and (h).hdr_recv_req
+ * can we simply assume these are uint64_t ?
+ */
 #define MCA_PML_OB1_RDMA_HDR_NTOH(h)                       \
     do {                                                   \
         MCA_PML_OB1_COMMON_HDR_NTOH((h).hdr_common);       \
-        (h).hdr_seg_cnt = ntohl((h).hdr_seg_cnt);          \
+        (h).hdr_req.lval = ntoh64((h).hdr_req.lval);       \
+        (h).hdr_frag.lval = ntoh64((h).hdr_frag.lval);     \
+        (h).hdr_recv_req.lval = ntoh64((h).hdr_recv_req.lval); \
         (h).hdr_rdma_offset = ntoh64((h).hdr_rdma_offset); \
         (h).hdr_dst_ptr = ntoh64((h).hdr_dst_ptr);         \
         (h).hdr_dst_size = ntoh64((h).hdr_dst_size);       \
@@ -363,7 +375,9 @@ static inline void mca_pml_ob1_rdma_hdr_prepare (mca_pml_ob1_rdma_hdr_t *hdr, ui
 #define MCA_PML_OB1_RDMA_HDR_HTON(h)                       \
     do {                                                   \
         MCA_PML_OB1_COMMON_HDR_HTON((h).hdr_common);       \
-        (h).hdr_seg_cnt = htonl((h).hdr_seg_cnt);          \
+        (h).hdr_req.lval = hton64((h).hdr_req.lval);       \
+        (h).hdr_frag.lval = hton64((h).hdr_frag.lval);     \
+        (h).hdr_recv_req.lval = hton64((h).hdr_recv_req.lval); \
         (h).hdr_rdma_offset = hton64((h).hdr_rdma_offset); \
         (h).hdr_dst_ptr = hton64((h).hdr_dst_ptr);         \
         (h).hdr_dst_size = hton64((h).hdr_dst_size);       \
hjelmn commented 9 years ago

@ggouaillardet Looks good to me so far

bosilca commented 9 years ago

@ggouaillardet https://github.com/ggouaillardet The change in pml_ob1.c seems wrong to me. The header that is prepared is ((mca_pml_ob1_fin_hdr_t ) fin->des_segments->seg_addr.pval) and not (mca_pml_ob1_fin_hdr_t )fin. "fin" is a mca_btl_base_descriptor_t which has nothing to do with a header.

On Wed, Mar 4, 2015 at 12:09 PM, Nathan Hjelm notifications@github.com wrote:

@ggouaillardet https://github.com/ggouaillardet Looks good to me so far

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/403#issuecomment-77199995.

hjelmn commented 9 years ago

@bosilca Ah, yes. Good catch. That line is clearly wrong.

hjelmn commented 9 years ago

@ggouaillardet You are byte-swapping some stuff that shouldn't be byte-swapped. I have a fix ready to go that I can commit.

hjelmn commented 9 years ago

ob1 fixed in d251fa15253a0270b54bfeeb8258e695e38bd65b

bosilca commented 9 years ago

Can you please also remote it from the comment?

Thanks, George.

On Apr 20, 2015, at 11:27 , Nathan Hjelm notifications@github.com wrote:

ob1 fixed in d251fa1 https://github.com/open-mpi/ompi/commit/d251fa15253a0270b54bfeeb8258e695e38bd65b — Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/403#issuecomment-94483576.

jsquyres commented 9 years ago

@siegmargross Can you give master a whirl and see if the problem is fixed for you now? Thanks.

jsquyres commented 9 years ago

This issue is now likely a (at least effectively) a duplicate of #540. So we might as well close one of them -- I'm arbitrarily choosing to close the older one, because --enable-heterogeneous builds again on the master; we just need a confirmation that it actually works (which can be done on #540).