ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

FI_MR_SCALABLE failure #1387

Closed bcernohous closed 6 years ago

bcernohous commented 6 years ago

In the latest github download (or last week’s), I see FI_MR_SCALABLE in gni?

fi_domain_attr:
    domain: 0x0
    name: /sys/class/gni/kgni0
    threading: FI_THREAD_SAFE
    control_progress: FI_PROGRESS_AUTO
    data_progress: FI_PROGRESS_AUTO
    resource_mgmt: FI_RM_ENABLED
    av_type: FI_AV_TABLE
    mr_mode: [ FI_MR_SCALABLE ]

But

PE 0 [unknown] [c1-0c0s12n1] [nid00241] LIBSMA ERROR: "fi_mr_reg(smati_ofi_global.domain, 0, UINT64_MAX, FI_REMOTE_READ | FI_REMOTE_WRITE | FI_WRITE, 0, (uint64_t)NULL, 0, &mr_reg, NULL)" failed with -22: "Invalid argument"

maybe I'm misinterpreting the bits, but I thought I was specifying 1.1 version on getinfo...

hppritcha commented 6 years ago

we need a fix for this before 1.5 is shipped.

jswaro commented 6 years ago

@bcernohous Where is this being generated from? Can you provide a test case?

bcernohous commented 6 years ago

I'll take a look today with latest commits and simplify a testcase.

jswaro commented 6 years ago

Thanks. I'll get this fixed ASAP.

bcernohous commented 6 years ago

$ cc -o getinfo getinfo.c -I/cray/css/users/libfabric-test/builds/latest-libfabric-CLE-6.X/include/ -L/cray/css/users/libfabric-test/builds/latest-libfabric-CLE-6.X/lib/ -lfabric

$ SHMEM_FI_MINOR=1 aprun -n1 ./getinfo 2>&1 | tee getinfo.1.1.out $ SHMEM_FI_MINOR=5 aprun -n1 ./getinfo 2>&1 | tee getinfo.1.5.out

$ grep SCAL /cray/css/users/bcernohous/scratch/getinfo.1.*.out /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ] /cray/css/users/bcernohous/scratch/getinfo.1.1.out: mr_mode: [ FI_MR_SCALABLE ]

jswaro commented 6 years ago

Can you provide the git hash associated with your build?

bcernohous commented 6 years ago

I originally was building my own but I reproduced that above with /css/users/libfabric-test/builds/latest-libfabric-CLE-6.X/lib/

bcernohous commented 6 years ago

Updated test. I accidentally commented out the hints in the previous version. getinfo.zip

or /cray/css/users/bcernohous/scratch/getinfo*

bcernohous commented 6 years ago

Hints and results for the latest testcase.

fi_version 1.5, version 1.1, HEADER API 1.5 HINTS: (null)(0.0):fi_info: caps: [ FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE ] mode: [ FI_LOCAL_MR ] addr_format: FI_FORMAT_UNSPEC src_addrlen: 0 dest_addrlen: 0 src_addr: (null) dest_addr: (null) handle: (null) fi_tx_attr: caps: [ FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE ] mode: [ ] op_flags: [ FI_DELIVERY_COMPLETE ] msg_order: [ ] comp_order: [ ] inject_size: 0 size: 0 iov_limit: 0 rma_iov_limit: 0 fi_rx_attr: caps: [ FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE ] mode: [ ] op_flags: [ FI_DELIVERY_COMPLETE ] msg_order: [ ] comp_order: [ ] total_buffered_recv: 0 size: 0 iov_limit: 0 fi_ep_attr: type: FI_EP_RDM protocol: FI_PROTO_UNSPEC protocol_version: 0 max_msg_size: 0 msg_prefix_size: 0 max_order_raw_size: 0 max_order_war_size: 0 max_order_waw_size: 0 mem_tag_format: 0x0000000000000000 tx_ctx_cnt: 0 rx_ctx_cnt: 0 auth_key_size: 0 fi_domain_attr: domain: 0x0 name: (null) threading: FI_THREAD_UNSPEC control_progress: FI_PROGRESS_AUTO data_progress: FI_PROGRESS_AUTO resource_mgmt: FI_RM_UNSPEC av_type: FI_AV_TABLE mr_mode: [ ] mr_key_size: 0 cq_data_size: 0 cq_cnt: 0 ep_cnt: 0 tx_ctx_cnt: 0 rx_ctx_cnt: 0 max_ep_tx_ctx: 0 max_ep_rx_ctx: 0 max_ep_stx_ctx: 0 max_ep_srx_ctx: 0 cntr_cnt: 0 mr_iov_limit: 0 caps: [ ] mode: [ ] auth_key_size: 0 max_err_data: 0 mr_cnt: 0 fi_fabric_attr: name: (null) prov_name: (null) prov_version: 0.0 api_version: 0.0

PROVIDER[0]: gni(1.1):fi_info: caps: [ FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE ] mode: [ FI_LOCAL_MR ] addr_format: FI_ADDR_GNI src_addrlen: 48 dest_addrlen: 48 src_addr: fi_addr_gni://600 dest_addr: (null) handle: (null) fi_tx_attr: caps: [ FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE ] mode: [ FI_LOCAL_MR ] op_flags: [ FI_DELIVERY_COMPLETE ] msg_order: [ FI_ORDER_SAS, FI_ORDER_STRICT ] comp_order: [ ] inject_size: 64 size: 500 iov_limit: 8 rma_iov_limit: 1 fi_rx_attr: caps: [ FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE ] mode: [ FI_LOCAL_MR ] op_flags: [ FI_DELIVERY_COMPLETE ] msg_order: [ FI_ORDER_SAS, FI_ORDER_STRICT ] comp_order: [ ] total_buffered_recv: 0 size: 500 iov_limit: 8 fi_ep_attr: type: FI_EP_RDM protocol: FI_PROTO_GNI protocol_version: 0 max_msg_size: 4294967295 msg_prefix_size: 0 max_order_raw_size: 0 max_order_war_size: 0 max_order_waw_size: 0 mem_tag_format: 0x0000000000000000 tx_ctx_cnt: 1 rx_ctx_cnt: 1 auth_key_size: 0 fi_domain_attr: domain: 0x0 name: /sys/class/gni/kgni0 threading: FI_THREAD_SAFE control_progress: FI_PROGRESS_AUTO data_progress: FI_PROGRESS_AUTO resource_mgmt: FI_RM_ENABLED av_type: FI_AV_TABLE mr_mode: [ FI_MR_SCALABLE ] mr_key_size: 8 cq_data_size: 8 cq_cnt: 1018 ep_cnt: -1 tx_ctx_cnt: 122 rx_ctx_cnt: 123 max_ep_tx_ctx: 128 max_ep_rx_ctx: 128 max_ep_stx_ctx: 0 max_ep_srx_ctx: 0 cntr_cnt: 1018 mr_iov_limit: 1 caps: [ FI_REMOTE_COMM ] mode: [ ] auth_key_size: 0 max_err_data: 0 mr_cnt: 65535 fi_fabric_attr: name: gni prov_name: gni prov_version: 1.1 api_version: 1.1

jswaro commented 6 years ago

So here's the problem. Prior to 1.5, FI_MR_UNSPEC was valid for use. However, this isn't being caught properly by the provider. We set the returned info's mr_mode to the hints (which is valid for 1.5, but not pre-1.5). This fails to cause the provider to choose a mr_mode, and this ends up being incorrect due to fi_alter_domain_attr changing the mode to FI_MR_SCALABLE.