ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

GNI: fi_endpoint() segfaults when using only FI_MSG capability #1419

Closed rhalkyard closed 6 years ago

rhalkyard commented 6 years ago

If I set a basic set of hints to select the gni provider, and set caps = FI_MSG without FI_RMA, I get a segfault when calling fi_endpoint(), with the following stack trace.

Adding FI_RMA to caps works around the issue, but as far as I can tell, it should be entirely valid to only request FI_MSG.

  _start@start.S:118
  __libc_start_main@0x2aaace7b86e4
  main@ep_segfault.c:49
  fi_endpoint$$CFE_id_ef787706_main@fi_endpoint.h:156
  gnix_ep_open@gnix_ep.c:2400
  _gnix_ep_nic_init@gnix_ep.c:2183
  _gnix_cm_nic_alloc@gnix_cm_nic.c:645
  gnix_nic_alloc@gnix_nic.c:1233
  _gnix_mbox_allocator_create@gnix_mbox_allocator.c:619
  __create_slab@gnix_mbox_allocator.c:364
  _gnix_get_next_reserved_key@gnix_auth_key.c:141
  _gnix_find_first_zero_bit@gnix_bitmap.c:138

git bisecting on libfabric points the finger at 9316e133 (in particular, the MR mode getting changed in fi_alter_domain_attr()), but I'm not sure whether that change in behavior is the root cause, or whether it's just revealing an underlying provider issue, so I thought I would report it here first – the issue does not show up under the verbs or sockets providers.

A minimal reproducer is below:

#include <rdma/fabric.h>
#include <rdma/fi_domain.h>
#include <rdma/fi_endpoint.h>
#include <rdma/fi_errno.h>
#include <stdio.h>
#include <assert.h>

int main(int argc, char ** argv) {
  struct fi_info *hints, *fi, *fPtr;
  struct fid_fabric *fab;
  struct fid_domain *dom;
  struct fid_ep *ep;
  struct fi_context2 ctx;
  int ret, i;

  hints = fi_allocinfo();

  hints->caps = FI_MSG; /* FI_MSG | FI_RMA works just fine */
  hints->mode = FI_RX_CQ_DATA | FI_NOTIFY_FLAGS_ONLY | FI_RESTRICTED_COMP | FI_CONTEXT | FI_CONTEXT2;
  hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY;

  ret = fi_getinfo(FI_VERSION(1,5), NULL, NULL, 0, hints, &fi);
  assert(ret == FI_SUCCESS);

  fPtr = fi;
  while (fPtr != NULL) {
    printf("%d: Fabric %s %s Provider %s\n", i, fPtr->domain_attr->name,
           fPtr->fabric_attr->name,
           fPtr->fabric_attr->prov_name);
        i++;
        fPtr = fPtr->next;
  }

  printf("\nUsing Fabric %s %s Provider %s\n", fi->domain_attr->name,
         fi->fabric_attr->name,
         fi->fabric_attr->prov_name);
  puts(fi_tostr(fi, FI_TYPE_INFO));

  ret = fi_fabric(fi->fabric_attr, &fab, &ctx);
  printf("fi_fabric() returned %s\n", fi_strerror(ret));
  assert(ret == FI_SUCCESS);

  ret = fi_domain(fab, fi, &dom, &ctx);
  printf("fi_domain() returned %s\n", fi_strerror(ret));
  assert(ret == FI_SUCCESS);

  ret = fi_endpoint(dom, fi, &ep, &ctx);   /* Segfault here */
  printf("fi_endpoint() returned %s\n", fi_strerror(ret));
  assert(ret == FI_SUCCESS);

  printf("All OK!\n");
  return 0;
}
rhalkyard commented 6 years ago

Addendum: the segfault occurred on a CLE 6.0 system. Running the same code on CLE 5.2, I get no segfault, but fi_endpoint() returns 18446744073709551600, which doesn't quite seem right to me either.

hppritcha commented 6 years ago

maybe we'll get lucky and ofiwg/libfabric#3469 will help. I'll check that out. Thanks for reporting and providing a test case.

hppritcha commented 6 years ago

I don't see this segfault on cori. Are you using the upstream libfabric or the one from this github repo? Also, which configure options are you using?

rhalkyard commented 6 years ago

Interesting. This is on crystal here at Cray, so it's possible there might be some discrepancy in the underlying software. For the record, my configure options are --enable-gni=yes --enable-debug cross_compiling=yes, and I seem to get the segfault on both this libfabric and the upstream version

Looks like we did indeed get lucky with ofiwg/libfabric#3469. Applied that patch to this repo's libfabric, and it seems that the segfault goes away. I think we can consider this closed once that PR gets merged.

hppritcha commented 6 years ago

@rhalkyard can you verify with upstream master that this problem is fixed?

rhalkyard commented 6 years ago

The gni provider in upstream master (and here) seems to be broken right now – I get a compile error while building. However, if I check out a few commits back (b86c958a4b615239ff7258c3a601b727455534a4 seems to be the issue) I no longer see the issue when running my reproducer under the same conditions.