ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
527 stars 369 forks source link

prov/ucx: fi_rdm_tagged_bw fi_av_insert error #10148

Open miharulidze opened 4 days ago

miharulidze commented 4 days ago

Describe the bug I'm trying to run fi_rdm_tagged_bw benchmark using UCX provider. On the client side I get the following error:

[1720021258.401279] [slimfly2:990044:0] ucp_ep.c:1054 UCX ERROR the parameter params->address must not be NULL [error] fabtests:common/shared.c:1502: fi_av_insert: number of addresses inserted = 0; number of addresses given = 1

To Reproduce

Server: fi_rdm_tagged_bw -p ucx -e rdm -I 512 -w 100 -W 1 -S 2097152 --pin-core 31 Client: fi_rdm_tagged_bw 192.168.1.11 -p ucx -e rdm -I 512 -w 100 -W 1 -S 2097152 --pin-core 31

Output Output with FI_LOG_LEVEL=debug :

ibfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>                                                                      
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hook=<not set>                
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem=<not set>   
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_CUDA not supported    
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ROCR not supported              
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ZE not supported                 
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_NEURON not supported             
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>  
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor uffd   
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor memhooks
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor cuda     
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor cuda_ipc    
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor rocr 
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor rocr_ipc             
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor xpmem                           
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor ze                   
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor import          
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:990134:1720021525::core:mr:ofi_default_cache_size():83<info> default cache size=1041463168
libfabric:990134:1720021525::core:mr:ofi_monitors_init():306<info> Default memory monitor is: memhooks
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>            
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable sar_threshold=<not set>
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable tx_size=<not set>       
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable rx_size=<not set>       
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable disable_cma=<not set>     
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable use_dsa_sar=<not set>
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable use_xpmem=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: shm (121.0)                                                    
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: sm2 (121.0)
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable enable_passthru=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable buffer_size=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable tx_size=<not set>     
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable rx_size=<not set>     
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable msg_tx_size=<not set>                          
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable msg_rx_size=<not set>                      
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable cm_progress_interval=<not set>         
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable cq_eq_fairness=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable data_auto_progress=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable use_rndv_write=<not set>    
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable def_wait_obj=<not set>  
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable def_tcp_wait_obj=<not set>  
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable detect_hmem_iface=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_rxm (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: verbs (121.0)
libfabric:990134:1720021525::ofi_mrail:core:fi_param_get_():372<info> variable config=<not set>    
libfabric:990134:1720021525::ofi_mrail:core:fi_param_get_():372<info> variable addr_strc=<not set>
libfabric:990134:1720021525::ofi_mrail:core:mrail_parse_env_vars():115<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_mrail (121.0)
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():372<info> variable spin_count=<not set>
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():372<info> variable retry=<not set>
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():372<info> variable max_peers=<not set>
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():386<info> read int var max_unacked=128
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_rxd (121.0)
libfabric:990134:1720021525::efa:fabric:efa_device_construct():67<info> efadv_query_device: Unknown error -95(-95)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ucx (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: udp (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: sockets (121.0)
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: tcp (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_perf (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_trace (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_debug (121.0)
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem=<not set>
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ZE not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_hmem (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_dmabuf_peer_mem (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_noop (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: off_coll (121.0)
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable devices=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():228<info> primary detected device: mlx5_0 
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable inject_limit=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():267<info> used inject size = 1024 
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable config=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable ns_enable=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable ns_port=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable tls=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable ep_flush=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable check_req_leak=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():306<info> Loaded UCX version 1.17.0
libfabric:990134:1720021525::ucx:core:ucx_getinfo():326<warn> fi_getinfo with non-NULL node or service is unsupported
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable enable_spawn=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():356<warn> UCX: spawn support 0 
libfabric:990134:1720021525::core:core:ofi_layering_ok():1289<info> Skipping ucx;ofi_rxm layering
libfabric:990134:1720021525::core:core:ofi_layering_ok():1289<info> Skipping ucx;ofi_rxd layering
libfabric:990134:1720021525::core:core:ofi_layering_ok():1289<info> Skipping ucx;ofi_mrail layering
libfabric:990134:1720021525::ucx:core:ucx_fabric_open():160<info> 
libfabric:990134:1720021525::core:core:fi_fabric_():1577<info> Opened fabric: ucx
libfabric:990134:1720021525::core:core:fi_fabric_():1584<info> Using ucx provider 1.21, path:/home/mkhalilo/Development/pcc/libfabric/build/lib/libfabric.so.1
libfabric:990134:1720021525::ucx:core:ofi_check_rx_attr():865<info> Tx only caps ignored in Rx caps
libfabric:990134:1720021525::ucx:core:ofi_check_tx_attr():963<info> Rx only caps ignored in Tx caps
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:990134:1720021525::ucx:core:ofi_check_rx_attr():865<info> Tx only caps ignored in Rx caps
libfabric:990134:1720021525::ucx:core:ofi_check_tx_attr():963<info> Rx only caps ignored in Tx caps
libfabric:990134:1720021525::ucx:core:ucx_av_insert():151<info> Try to insert address #0, offset=0 (size=1) fi_addr=0x4165b0
[1720021525.665062] [slimfly2:990134:0]          ucp_ep.c:1054 UCX  ERROR the parameter params->address must not be NULL
[error] fabtests:common/shared.c:1502: fi_av_insert: number of addresses inserted = 0; number of addresses given = 1

Environment: OS Rocky Linux 9.4 UCX v1.16.0 libfabric master

j-xiong commented 4 days ago

The ucx provider doesn't support remote address resolution via fi_getinfo(). Please add the -b option to the command line to enable out-of-band address exchange.