Open danielap1996 opened 9 months ago
The error returns from fi_getinfo need significant improvement. In general an error like ret=-61 (No data available) means libfabric attempted to enumerate all the NICs, but did not find an acceptable provider which offered an acceptable NIC.
This is an error which often occurs in customers and lacks any actionable information. Usually, the cause is the desired provider was not available on the system or the desired provider was unable to find an acceptable NIC to offer.
The next step is often to repeat the test with FI_LOG_LEVEL=info. However some patches in these code paths a couple years ago (commit f4715e8382bb90e99ba15f1388b3a41f8b9455fd) made FI_WARN and FI_INFO calls into FI_DBG, so typical non-debug builds lack the key messages about device and provider discovery which are needed to debug this. So end users and in-distro libfabric users are typically stuck atthis point and must resort to provider specific mechanisms to debug what is happening or must locate libfabric source and rebuild it with debug (making sure not to change other options. A task which is beyond that of a typical sysadmin using an in-distro libfabric or an ISV provided MPI or application stack which includes libfabric).
The ideal customer facing answer would be for provider enumeration to accumulate a set of text messages from each provider and when a provider fails to find an acceptable device, the provider could provide a more detailed string as to why (probably a list of strings reflecting NICs it looked at and why it rejected them). Then if the fi_getinfo fails to find any provider, fi_getinfo could output (or return) a detailed message showing what providers it attempted and why they each indicated they could not find a device. Such strings may be long. I've implemented logging mechanisms like this in past products and it amounted to retaining a tree of error messages, with a list per provider and then only outputting the tree at the higher level routine where the issue was "realized" and discarding the tree if at least 1 provider successfully found NICs.
Running the same fab test with FI_LOG_LEVEL=info
$>/opt/fabtests/bin$ ./fi_av_test -g 127.0.1.1 -n 1
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem=<not set>
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ZE not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:845162:1707847821::core:mr:ofi_default_cache_size():78<info> default cache size=1750878720
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:845162:1707847821::psm3:core:fi_psm3_ini():928<info> xxxxxxVM:pid845162: build options: VERSION=305.1010=3.5.1.1, HAVE_PSM3_SRC=1, PSM3_CUDA=0
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_NAME_SERVER=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_TAGGED_RMA=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_str():124<info> xxxxxxVM:pid845162: variable FI_PSM3_UUID=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_DELAY=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_TIMEOUT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_PROG_INTERVAL=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_str():124<info> xxxxxxVM:pid845162: variable FI_PSM3_PROG_AFFINITY=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_INJECT_SIZE=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_LOCK_LEVEL=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_LAZY_CONN=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_CONN_TIMEOUT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_DISCONNECT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_str():124<info> xxxxxxVM:pid845162: variable FI_PSM3_TAG_LAYOUT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_YIELD_MODE=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: psm3 (305.1010)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: usnic (1.0)
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable sar_threshold=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable disable_cma=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable use_dsa_sar=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable use_xpmem=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: shm (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: sm2 (120.0)
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable enable_passthru=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable buffer_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable msg_tx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable msg_rx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable cm_progress_interval=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable cq_eq_fairness=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable data_auto_progress=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_rndv_write=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable def_wait_obj=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable def_tcp_wait_obj=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_rxm (120.0)
libfabric:845162:1707847821::ofi_mrail:core:fi_param_get_():372<info> variable config=<not set>
libfabric:845162:1707847821::ofi_mrail:core:fi_param_get_():372<info> variable addr=<not set>
libfabric:845162:1707847821::ofi_mrail:core:fi_param_get_():372<info> variable addr_strc=<not set>
libfabric:845162:1707847821::ofi_mrail:core:mrail_parse_env_vars():115<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_mrail (120.0)
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable spin_count=<not set>
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable retry=<not set>
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable max_peers=<not set>
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable max_unacked=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_rxd (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: opx (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: udp (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: sockets (120.0)
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: tcp (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_perf (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_trace (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_debug (120.0)
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem=<not set>
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ZE not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_hmem (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_dmabuf_peer_mem (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_noop (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: off_coll (120.0)
libfabric:845162:1707847821::opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821::usnic:fabric:usdf_getinfo():763<trace>
libfabric:845162:1707847821::usnic:fabric:usdf_getinfo():777<warn> failed to usdf_get_devinfo, ret=-19 (No such device)
libfabric:845162:1707847821::usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821::psm3:core:psmx3_getinfo():714<info> xxxxxxVM:pid845162:
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():361<info> xxxxxxVM:pid845162: Unsupported address format
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():363<info> xxxxxxVM:pid845162: Supported: FI_ADDR_PSMX3
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():365<info> xxxxxxVM:pid845162: Supported: FI_ADDR_STR
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():367<info> xxxxxxVM:pid845162: Requested: FI_SOCKADDR
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider psm3 returned -61 (No data available)
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1181<info> Provider ofi_rxm is excluded
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1181<info> Provider ofi_rxm is excluded
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():763<trace>
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;psm3 layering
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxm
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;shm layering
libfabric:845162:1707847821:ofi_rxm:udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;sockets layering
libfabric:845162:1707847821:ofi_rxm:sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():763<trace>
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;psm3 layering
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxm
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;shm layering
libfabric:845162:1707847821:ofi_rxm:udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;sockets layering
libfabric:845162:1707847821:ofi_rxm:sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:usnic:fabric:usdf_getinfo():763<trace>
libfabric:845162:1707847821:ofi_rxd:usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1199<info> Skipping util;psm3 layering
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxm
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1199<info> Skipping util;shm layering
libfabric:845162:1707847821:ofi_rxd:udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxd:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1199<info> Skipping util;sockets layering
libfabric:845162:1707847821:ofi_rxd:sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:845162:1707847821::shm:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider shm returned -22 (Invalid argument)
libfabric:845162:1707847821::udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821::tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821::sockets:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider sockets returned -22 (Invalid argument)
libfabric:845162:1707847821::sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821::ofi_mrail:fabric:mrail_get_core_info():285<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
fi_getinfo(): unit/av_test.c:1148, ret=-61 (No data available)
libfabric:845162:1707847821::usnic:fabric:usdf_fini():1039<trace>
libfabric:845162:1707847821::psm3:core:psmx3_fini():887<info> xxxxxxVM:pid845162:
Hi there @danielap1996 and thanks for opening the issue!
The av test takes in the address to insert (-g) as well as the source address (-s). You'll need both to properly run the test.
In addition, I recommend explicitly setting the provider (-p) that you're hoping to target to make sure the provider you want on your system is working. For example, to run with the tcp provider:
fi_av_test -g 127.0.0.1 -n 1 -p tcp -s 127.0.0.1
Let me know if you're still seeing an issue.
That response was supper fast !!! It solved the issue:
$> ./fi_av_test -g 127.0.0.1 -n 1 -p tcp -s 127.0.0.1
Testing AVs on fabric 127.0.0.1/32
Testing with type = FI_AV_MAP
Running av_open_close [Test open and close AVs of varying sizes]...PASS!
Running av_good_sync [Test sync AV insert with good address]...PASS!
Running av_null_fi_addr [Test AV insert without specifying fi_addr]...skipped because: test not valid for AV type FI_AV_MAP
Running av_good_vector_async [Test async AV insert with vector of good addresses]...PASS!
Running av_zero_async [Test async insert AV insert of zero addresses]...PASS!
Running av_good_2vector_async [Test async AV inserts with two address vectors]...PASS!
Running av_insert_stages [Test AV insert at various stages]...PASS!
Testing with invalid address
Running av_bad_sync [Test sync AV insert of bad address]...PASS!
Running av_goodbad_vector_sync [Test sync AV insert of 1 good and 1 bad address]...PASS!
Running av_goodbad_vector_async [Test async AV insert with good and bad address]...PASS!
Running av_goodbad_vector_sync_err [Test AV insert of 1 good, 1 bad address using FI_SYNC_ERR]...skipped because: test not valid for AV type FI_AV_MAP
Testing with type = FI_AV_TABLE
Running av_open_close [Test open and close AVs of varying sizes]...PASS!
Running av_good_sync [Test sync AV insert with good address]...PASS!
Running av_null_fi_addr [Test AV insert without specifying fi_addr]...PASS!
Running av_good_vector_async [Test async AV insert with vector of good addresses]...PASS!
Running av_zero_async [Test async insert AV insert of zero addresses]...PASS!
Running av_good_2vector_async [Test async AV inserts with two address vectors]...PASS!
Running av_insert_stages [Test AV insert at various stages]...PASS!
Testing with invalid address
Running av_bad_sync [Test sync AV insert of bad address]...PASS!
Running av_goodbad_vector_sync [Test sync AV insert of 1 good and 1 bad address]...PASS!
Running av_goodbad_vector_async [Test async AV insert with good and bad address]...PASS!
Running av_goodbad_vector_sync_err [Test AV insert of 1 good, 1 bad address using FI_SYNC_ERR]...PASS!
Summary: all tests passed
Could you please change the test to be a bit more "friendly" to users?
something like:
if user don't give server info - take the current server ip by the hostname -i
command
if user give only server ip , w/o client ip - set the client ip to be the same as the server ip
things like that.
@danielap1996 Yeah there are definitely some issues with fabtests in regards to how it handles source addressing. This is because some providers handle it differently so it's difficult to make a universal solution that is also correct with the API without forcing something that works. I'm going to change your issue title to reflect the request in clarification so we can track it and make sure we address it in the future. Thank you!
Hi, I was trying to run some of teh fabtest tests but they were getting fail on fi_getinfo(): unit/av_test.c:1148, ret=-61 (No data available)
This is how I was install libfabric:
This is how I was install fatests:
test run example:
fi_info -l output: