openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.08k stars 412 forks source link

mm ep failed to connect to remote FIFO id : shared memory error; open(file_name=/proc/25599/fd/35 flags=0x0) failed: No such file or directory #8511

Open jamesongithub opened 1 year ago

jamesongithub commented 1 year ago

Describe the bug

A clear and concise description of what the bug is. During an mpirun of hpl benchmark ucx errors were encountered which caused the job to fail.

The error message looks like the following:

[1662593980.908446] [slurm-slehpc15-james-hpc-pg0-4:25595:0]        mm_posix.c:207  UCX  ERROR   open(file_name=/proc/25599/fd/35 flags=0x0) failed: No such file or directory
[1662593980.908470] [slurm-slehpc15-james-hpc-pg0-4:25595:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc0000008c00063ff: Shared memory error

Looks like a side issue that was reported in https://github.com/openucx/ucx/issues/4224 as well as https://github.com/easybuilders/easybuild/issues/756

Steps to Reproduce

mpirun --debug-daemons \
    --mca opal_common_ucx_verbose 9 \
    --allow-run-as-root \
    --mca btl ^tcp \
    --mca opal_common_ucx_opal_mem_hooks 1 \
    /shared/home/james/hpl-2.3-dl/bin/xhpl

Setup and versions

 cat /etc/os-*release
NAME="SLE_HPC"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise High Performance Computing 15 SP4"
ID="sle_hpc"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sle_hpc:15:sp4"
DOCUMENTATION_URL="https://documentation.suse.com/"
VARIANT_ID="sles-hpc"

uname -a
Linux slurm-slehpc15-james-scheduler 5.14.21-150400.14.7-azure #1 SMP PREEMPT_DYNAMIC Tue Jul 12 09:32:53 UTC 2022 (00ddf73) x86_64 x86_64 x86_64 GNU/Linux
rpm -q rdma-core
rdma-core-38.1-150400.4.6.x86_64
    - or: MLNX_OFED version `ofed_info -s`
sudo ibstatus
Infiniband device 'mlx5_0' port 1 status:
    default gid:     unknown
    base lid:    0x0
    sm lid:      0x0
    state:       4: ACTIVE
    phys state:  5: LinkUp
    rate:        40 Gb/sec (4X QDR)
    link_layer:  Ethernet

Additional information (depending on the issue)

Report bugs to http://www.open-mpi.org/community/help/

- Output of `ucx_info -d` to show transports and devices recognized by UCX

ucx_info -d If 'ucx_info' is not a typo you can use command-not-found to lookup the package that contains it, like this: cnf ucx_info



- Configure result - config.log
- Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

https://gist.github.com/jamesongithub/bda88d5575aa06bedcf31255dae82b25
jamesongithub commented 1 year ago

@hoopoepg any idea?

hoopoepg commented 1 year ago

something wrong with proc file system. Are there containers used? try to add variable UCX_POSIX_USE_PROC_LINK=n to command line

jamesongithub commented 1 year ago

@hoopoepg no containers

tried adding UCX_POSIX_USE_PROC_LINK=n didn't see a difference

Please see log: https://gist.github.com/jamesongithub/ca1c9618f0dd994f6bf8356147111543

hoopoepg commented 1 year ago

ok, it seems POSIX shm transport failed to access to shared memory. could you try to exclude posix from transports? add UCX_TLS=^posix variable to your command line

jamesongithub commented 1 year ago

@hoopoepg

/proc errors gone, now are shmat errors:

[1665173472.252802] [slurm-slehpc15-james-hpc-pg0-12:44314:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655360) failed: Invalid argument
[1665173472.252818] [slurm-slehpc15-james-hpc-pg0-12:44314:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44311] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44313] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[1665173472.252778] [slurm-slehpc15-james-hpc-pg0-12:44312:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655360) failed: Invalid argument
[1665173472.252792] [slurm-slehpc15-james-hpc-pg0-12:44312:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44312] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[1665173472.254258] [slurm-slehpc15-james-hpc-pg0-3:44147:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254273] [slurm-slehpc15-james-hpc-pg0-3:44147:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.252575] [slurm-slehpc15-james-hpc-pg0-12:44296:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655382) failed: Invalid argument
[1665173472.252585] [slurm-slehpc15-james-hpc-pg0-12:44296:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0016: Shared memory error
[slurm-slehpc15-james-hpc-pg0-3:44147] pml_ucx.c:419  Error: ucp_ep_create(proc=121) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44314] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-3:44190] pml_ucx.c:419  Error: ucp_ep_create(proc=121) failed: Shared memory error
[1665173472.252885] [slurm-slehpc15-james-hpc-pg0-12:44313:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655360) failed: Invalid argument
[1665173472.252902] [slurm-slehpc15-james-hpc-pg0-12:44313:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[1665173472.254221] [slurm-slehpc15-james-hpc-pg0-3:44148:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(sysv/memory cma/memory dc_mlx5/mlx5_0:1); 
[slurm-slehpc15-james-hpc-pg0-12:44294] *** An error occurred in MPI_Init
[slurm-slehpc15-james-hpc-pg0-12:44294] *** reported by process [2433024001,506]
[slurm-slehpc15-james-hpc-pg0-12:44294] *** on a NULL communicator
[slurm-slehpc15-james-hpc-pg0-12:44294] *** Unknown error
[slurm-slehpc15-james-hpc-pg0-12:44294] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[slurm-slehpc15-james-hpc-pg0-12:44294] ***    and potentially your MPI job)
[1665173472.254259] [slurm-slehpc15-james-hpc-pg0-3:44150:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(sysv/memory cma/memory dc_mlx5/mlx5_0:1); 
[1665173472.254621] [slurm-slehpc15-james-hpc-pg0-3:44186:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254633] [slurm-slehpc15-james-hpc-pg0-3:44186:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254622] [slurm-slehpc15-james-hpc-pg0-3:44187:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254636] [slurm-slehpc15-james-hpc-pg0-3:44187:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254650] [slurm-slehpc15-james-hpc-pg0-3:44185:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254667] [slurm-slehpc15-james-hpc-pg0-3:44185:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254558] [slurm-slehpc15-james-hpc-pg0-3:44188:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254576] [slurm-slehpc15-james-hpc-pg0-3:44188:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
hoopoepg commented 1 year ago

it seems there are some restrictions to operate shared memory on your system - UCX can't use this transport at all. to disable it add variable UCX_TLS=^sm and it will allow to run your application

jamesongithub commented 1 year ago

with UCX_TLS=^sm still having issues.

[1665438954.954469] [slurm-slehpc15-james-hpc-pg0-2:26258:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27333] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27333] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:27334] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27334] Process is bound: distance to device is 0.000000
[1665438954.955595] [slurm-slehpc15-james-hpc-pg0-2:26262:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.956864] [slurm-slehpc15-james-hpc-pg0-2:26266:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.958393] [slurm-slehpc15-james-hpc-pg0-2:26264:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:27320] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27320] Process is bound: distance to device is 0.000000
[1665438954.961159] [slurm-slehpc15-james-hpc-pg0-2:26250:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.962243] [slurm-slehpc15-james-hpc-pg0-2:26263:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27331] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27331] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27331] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component vader returned success
[1665438954.964097] [slurm-slehpc15-james-hpc-pg0-2:26265:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27330] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27330] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27330] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27342] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27342] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27342] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27337] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27337] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27337] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component vader returned success
[1665438954.970425] [slurm-slehpc15-james-hpc-pg0-2:26251:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27303] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27303] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27303] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component vader returned success
[1665438954.972518] [slurm-slehpc15-james-hpc-pg0-2:26257:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27326] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27326] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27326] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component vader returned success
[1665438954.974107] [slurm-slehpc15-james-hpc-pg0-2:26260:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.974839] [slurm-slehpc15-james-hpc-pg0-2:26259:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.975534] [slurm-slehpc15-james-hpc-pg0-2:26271:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.977762] [slurm-slehpc15-james-hpc-pg0-2:26253:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27335] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27335] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27335] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27333] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27333] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27333] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27334] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component vader returned success
[1665438954.979615] [slurm-slehpc15-james-hpc-pg0-2:26255:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.980498] [slurm-slehpc15-james-hpc-pg0-2:26254:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.981717] [slurm-slehpc15-james-hpc-pg0-2:26272:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.982201] [slurm-slehpc15-james-hpc-pg0-2:26267:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.982706] [slurm-slehpc15-james-hpc-pg0-2:26268:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.983096] [slurm-slehpc15-james-hpc-pg0-2:26269:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.984676] [slurm-slehpc15-james-hpc-pg0-2:26270:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27320] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component vader returned success
[1665438954.992201] [slurm-slehpc15-james-hpc-pg0-1:27304:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.009533] [slurm-slehpc15-james-hpc-pg0-1:27306:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.078640] [slurm-slehpc15-james-hpc-pg0-1:27308:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.106084] [slurm-slehpc15-james-hpc-pg0-1:27300:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.120960] [slurm-slehpc15-james-hpc-pg0-1:27317:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.133759] [slurm-slehpc15-james-hpc-pg0-1:27315:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.135367] [slurm-slehpc15-james-hpc-pg0-1:27314:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.138056] [slurm-slehpc15-james-hpc-pg0-1:27299:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.152223] [slurm-slehpc15-james-hpc-pg0-1:27309:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.154816] [slurm-slehpc15-james-hpc-pg0-1:27310:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.157326] [slurm-slehpc15-james-hpc-pg0-1:27318:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.166507] [slurm-slehpc15-james-hpc-pg0-1:27302:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.167333] [slurm-slehpc15-james-hpc-pg0-1:27313:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.169634] [slurm-slehpc15-james-hpc-pg0-1:27321:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.172057] [slurm-slehpc15-james-hpc-pg0-1:27327:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.172603] [slurm-slehpc15-james-hpc-pg0-1:27340:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.177517] [slurm-slehpc15-james-hpc-pg0-1:27319:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.178461] [slurm-slehpc15-james-hpc-pg0-1:27341:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.179619] [slurm-slehpc15-james-hpc-pg0-1:27338:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.181082] [slurm-slehpc15-james-hpc-pg0-1:27329:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.182425] [slurm-slehpc15-james-hpc-pg0-1:27322:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.184469] [slurm-slehpc15-james-hpc-pg0-1:27324:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.189443] [slurm-slehpc15-james-hpc-pg0-1:27328:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.190392] [slurm-slehpc15-james-hpc-pg0-1:27325:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.192031] [slurm-slehpc15-james-hpc-pg0-1:27332:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.192613] [slurm-slehpc15-james-hpc-pg0-1:27336:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.194418] [slurm-slehpc15-james-hpc-pg0-1:27323:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.195034] [slurm-slehpc15-james-hpc-pg0-1:27339:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.196726] [slurm-slehpc15-james-hpc-pg0-1:27331:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.200069] [slurm-slehpc15-james-hpc-pg0-1:27330:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.200985] [slurm-slehpc15-james-hpc-pg0-1:27342:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.202910] [slurm-slehpc15-james-hpc-pg0-1:27337:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.203561] [slurm-slehpc15-james-hpc-pg0-1:27333:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.204374] [slurm-slehpc15-james-hpc-pg0-1:27326:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.205513] [slurm-slehpc15-james-hpc-pg0-1:27303:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.205674] [slurm-slehpc15-james-hpc-pg0-1:27335:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.206344] [slurm-slehpc15-james-hpc-pg0-1:27334:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.206871] [slurm-slehpc15-james-hpc-pg0-1:27320:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.216721] [slurm-slehpc15-james-hpc-pg0-1:27314:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216712] [slurm-slehpc15-james-hpc-pg0-1:27316:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216732] [slurm-slehpc15-james-hpc-pg0-1:27301:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216715] [slurm-slehpc15-james-hpc-pg0-1:27304:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216717] [slurm-slehpc15-james-hpc-pg0-1:27306:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216713] [slurm-slehpc15-james-hpc-pg0-1:27311:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216716] [slurm-slehpc15-james-hpc-pg0-1:27312:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216764] [slurm-slehpc15-james-hpc-pg0-1:27308:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27314:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216805] [slurm-slehpc15-james-hpc-pg0-1:27316:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216815] [slurm-slehpc15-james-hpc-pg0-1:27301:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27304:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216782] [slurm-slehpc15-james-hpc-pg0-1:27299:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27306:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216771] [slurm-slehpc15-james-hpc-pg0-1:27305:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216807] [slurm-slehpc15-james-hpc-pg0-1:27311:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216774] [slurm-slehpc15-james-hpc-pg0-1:27300:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216799] [slurm-slehpc15-james-hpc-pg0-1:27317:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216812] [slurm-slehpc15-james-hpc-pg0-1:27310:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216829] [slurm-slehpc15-james-hpc-pg0-1:27318:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216863] [slurm-slehpc15-james-hpc-pg0-1:27308:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216865] [slurm-slehpc15-james-hpc-pg0-1:27299:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216833] [slurm-slehpc15-james-hpc-pg0-1:27309:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216821] [slurm-slehpc15-james-hpc-pg0-1:27302:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216855] [slurm-slehpc15-james-hpc-pg0-1:27305:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216819] [slurm-slehpc15-james-hpc-pg0-1:27315:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216856] [slurm-slehpc15-james-hpc-pg0-1:27300:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216892] [slurm-slehpc15-james-hpc-pg0-1:27310:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216910] [slurm-slehpc15-james-hpc-pg0-1:27318:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216916] [slurm-slehpc15-james-hpc-pg0-1:27309:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216903] [slurm-slehpc15-james-hpc-pg0-1:27302:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216905] [slurm-slehpc15-james-hpc-pg0-1:27315:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[1665438955.216881] [slurm-slehpc15-james-hpc-pg0-1:27317:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216878] [slurm-slehpc15-james-hpc-pg0-1:27313:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216880] [slurm-slehpc15-james-hpc-pg0-1:27321:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216969] [slurm-slehpc15-james-hpc-pg0-1:27340:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216976] [slurm-slehpc15-james-hpc-pg0-1:27329:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216974] [slurm-slehpc15-james-hpc-pg0-1:27322:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216950] [slurm-slehpc15-james-hpc-pg0-1:27336:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216969] [slurm-slehpc15-james-hpc-pg0-1:27324:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27338:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216958] [slurm-slehpc15-james-hpc-pg0-1:27313:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216988] [slurm-slehpc15-james-hpc-pg0-1:27325:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216975] [slurm-slehpc15-james-hpc-pg0-1:27339:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216956] [slurm-slehpc15-james-hpc-pg0-1:27332:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27341:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27341:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27319:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27319:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[slurm-slehpc15-james-hpc-pg0-1:27328] [[21652,1],28] selected pml cm, but peer [[21652,1],0] on slurm-slehpc15-james-hpc-pg0-1 selected pml ucx
[slurm-slehpc15-james-hpc-pg0-2:26233] [[21652,1],48] selected pml cm, but peer [[21652,1],0] on slurm-slehpc15-james-hpc-pg0-1 selected pml ucx
[1665438955.216943] [slurm-slehpc15-james-hpc-pg0-1:27327:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27327:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216959] [slurm-slehpc15-james-hpc-pg0-1:27321:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217002] [slurm-slehpc15-james-hpc-pg0-1:27312:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217053] [slurm-slehpc15-james-hpc-pg0-1:27340:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217054] [slurm-slehpc15-james-hpc-pg0-1:27329:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217049] [slurm-slehpc15-james-hpc-pg0-1:27322:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217025] [slurm-slehpc15-james-hpc-pg0-1:27336:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217051] [slurm-slehpc15-james-hpc-pg0-1:27324:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217053] [slurm-slehpc15-james-hpc-pg0-1:27338:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217063] [slurm-slehpc15-james-hpc-pg0-1:27331:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217144] [slurm-slehpc15-james-hpc-pg0-1:27331:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217041] [slurm-slehpc15-james-hpc-pg0-1:27337:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217117] [slurm-slehpc15-james-hpc-pg0-1:27337:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217074] [slurm-slehpc15-james-hpc-pg0-1:27325:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217156] [slurm-slehpc15-james-hpc-pg0-1:27335:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217232] [slurm-slehpc15-james-hpc-pg0-1:27335:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217088] [slurm-slehpc15-james-hpc-pg0-1:27303:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217167] [slurm-slehpc15-james-hpc-pg0-1:27303:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217042] [slurm-slehpc15-james-hpc-pg0-1:27342:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217120] [slurm-slehpc15-james-hpc-pg0-1:27342:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217049] [slurm-slehpc15-james-hpc-pg0-1:27339:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217035] [slurm-slehpc15-james-hpc-pg0-1:27332:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217069] [slurm-slehpc15-james-hpc-pg0-1:27323:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217148] [slurm-slehpc15-james-hpc-pg0-1:27323:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217069] [slurm-slehpc15-james-hpc-pg0-1:27330:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217147] [slurm-slehpc15-james-hpc-pg0-1:27330:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217106] [slurm-slehpc15-james-hpc-pg0-1:27320:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217184] [slurm-slehpc15-james-hpc-pg0-1:27320:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217096] [slurm-slehpc15-james-hpc-pg0-1:27326:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217175] [slurm-slehpc15-james-hpc-pg0-1:27326:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217134] [slurm-slehpc15-james-hpc-pg0-1:27307:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217218] [slurm-slehpc15-james-hpc-pg0-1:27307:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217105] [slurm-slehpc15-james-hpc-pg0-1:27334:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217182] [slurm-slehpc15-james-hpc-pg0-1:27334:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217082] [slurm-slehpc15-james-hpc-pg0-1:27333:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217155] [slurm-slehpc15-james-hpc-pg0-1:27333:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26235:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217517] [slurm-slehpc15-james-hpc-pg0-2:26232:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26237:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217518] [slurm-slehpc15-james-hpc-pg0-2:26234:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26234:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217517] [slurm-slehpc15-james-hpc-pg0-2:26231:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26231:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217521] [slurm-slehpc15-james-hpc-pg0-2:26230:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26230:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26236:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26236:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217523] [slurm-slehpc15-james-hpc-pg0-2:26240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26235:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26232:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217615] [slurm-slehpc15-james-hpc-pg0-2:26237:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217598] [slurm-slehpc15-james-hpc-pg0-2:26239:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217617] [slurm-slehpc15-james-hpc-pg0-2:26238:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217699] [slurm-slehpc15-james-hpc-pg0-2:26238:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217625] [slurm-slehpc15-james-hpc-pg0-2:26229:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217706] [slurm-slehpc15-james-hpc-pg0-2:26229:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217625] [slurm-slehpc15-james-hpc-pg0-2:26248:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217704] [slurm-slehpc15-james-hpc-pg0-2:26248:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217618] [slurm-slehpc15-james-hpc-pg0-2:26246:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217695] [slurm-slehpc15-james-hpc-pg0-2:26246:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217634] [slurm-slehpc15-james-hpc-pg0-2:26244:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217720] [slurm-slehpc15-james-hpc-pg0-2:26244:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217615] [slurm-slehpc15-james-hpc-pg0-2:26247:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217576] [slurm-slehpc15-james-hpc-pg0-2:26242:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217659] [slurm-slehpc15-james-hpc-pg0-2:26242:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217623] [slurm-slehpc15-james-hpc-pg0-2:26245:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217705] [slurm-slehpc15-james-hpc-pg0-2:26245:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217603] [slurm-slehpc15-james-hpc-pg0-2:26243:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217682] [slurm-slehpc15-james-hpc-pg0-2:26243:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217712] [slurm-slehpc15-james-hpc-pg0-2:26239:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217760] [slurm-slehpc15-james-hpc-pg0-2:26247:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217786] [slurm-slehpc15-james-hpc-pg0-2:26264:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217792] [slurm-slehpc15-james-hpc-pg0-2:26265:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217894] [slurm-slehpc15-james-hpc-pg0-2:26265:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26256:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26256:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217806] [slurm-slehpc15-james-hpc-pg0-2:26271:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217908] [slurm-slehpc15-james-hpc-pg0-2:26271:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217820] [slurm-slehpc15-james-hpc-pg0-2:26255:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217917] [slurm-slehpc15-james-hpc-pg0-2:26255:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217721] [slurm-slehpc15-james-hpc-pg0-2:26261:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26261:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217917] [slurm-slehpc15-james-hpc-pg0-2:26253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217826] [slurm-slehpc15-james-hpc-pg0-2:26251:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217925] [slurm-slehpc15-james-hpc-pg0-2:26251:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217720] [slurm-slehpc15-james-hpc-pg0-2:26252:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217810] [slurm-slehpc15-james-hpc-pg0-2:26252:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217798] [slurm-slehpc15-james-hpc-pg0-2:26257:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217893] [slurm-slehpc15-james-hpc-pg0-2:26257:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217750] [slurm-slehpc15-james-hpc-pg0-2:26260:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217833] [slurm-slehpc15-james-hpc-pg0-2:26260:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217725] [slurm-slehpc15-james-hpc-pg0-2:26249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217819] [slurm-slehpc15-james-hpc-pg0-2:26249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26258:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217820] [slurm-slehpc15-james-hpc-pg0-2:26258:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217727] [slurm-slehpc15-james-hpc-pg0-2:26250:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217801] [slurm-slehpc15-james-hpc-pg0-2:26250:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26266:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26266:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26263:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217918] [slurm-slehpc15-james-hpc-pg0-2:26263:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217776] [slurm-slehpc15-james-hpc-pg0-2:26262:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217885] [slurm-slehpc15-james-hpc-pg0-2:26262:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26259:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217914] [slurm-slehpc15-james-hpc-pg0-2:26259:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.218026] [slurm-slehpc15-james-hpc-pg0-2:26270:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217936] [slurm-slehpc15-james-hpc-pg0-2:26241:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218021] [slurm-slehpc15-james-hpc-pg0-2:26241:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217908] [slurm-slehpc15-james-hpc-pg0-2:26264:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217962] [slurm-slehpc15-james-hpc-pg0-2:26268:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218045] [slurm-slehpc15-james-hpc-pg0-2:26268:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217926] [slurm-slehpc15-james-hpc-pg0-2:26272:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218004] [slurm-slehpc15-james-hpc-pg0-2:26272:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217931] [slurm-slehpc15-james-hpc-pg0-2:26267:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218036] [slurm-slehpc15-james-hpc-pg0-2:26267:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217910] [slurm-slehpc15-james-hpc-pg0-2:26254:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217992] [slurm-slehpc15-james-hpc-pg0-2:26254:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217952] [slurm-slehpc15-james-hpc-pg0-2:26269:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218036] [slurm-slehpc15-james-hpc-pg0-2:26269:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.218105] [slurm-slehpc15-james-hpc-pg0-2:26270:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[slurm-slehpc15-james-hpc-pg0-1:27328] *** An error occurred in MPI_Init
[slurm-slehpc15-james-hpc-pg0-1:27328] *** reported by process [1418985473,28]
[slurm-slehpc15-james-hpc-pg0-1:27328] *** on a NULL communicator
[slurm-slehpc15-james-hpc-pg0-1:27328] *** Unknown error
[slurm-slehpc15-james-hpc-pg0-1:27328] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[slurm-slehpc15-james-hpc-pg0-1:27328] ***    and potentially your MPI job)
[slurm-slehpc15-james-hpc-pg0-1:27285] 87 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[slurm-slehpc15-james-hpc-pg0-1:27285] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[slurm-slehpc15-james-hpc-pg0-1:27285] 87 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[slurm-slehpc15-james-hpc-pg0-1:27285] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
[slurm-slehpc15-james-hpc-pg0-1:27285] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
jamesongithub commented 1 year ago

preferably instead of disabling shared memory we can adjust system also since if we disable ucx completely we can get a successfully run

are these reasonable?

ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509481983
max total shared memory (kbytes) = 4611686018427386880
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767
hoopoepg commented 1 year ago

hi I don't see any issues in ipcs -l output - we are testing UCX on similar configuration and it works fine. as I can see from logs UCX was able to startup, but some peers selected pml cm instead of ucx. can you add -mca pml ucx to command line to force using UCX?

thank you

jamesongithub commented 1 year ago

hey with -mca pml ucx i was able to get a successfully run. here is some output

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            slurm-slehpc15-james-hpc-pg0-2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4120

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   slurm-slehpc15-james-hpc-pg0-2
  Local device: mlx5_0
--------------------------------------------------------------------------
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23280] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23280] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23280] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-2:23296] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-2:23296] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23296] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23233] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23233] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23233] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23234] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23234] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23234] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23222] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23222] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23222] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23240] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23240] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23240] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23226] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23226] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23226] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23226] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23226] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23226] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23226] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23226] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23226] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
...
[1665518071.720969] [slurm-slehpc15-james-hpc-pg0-2:23292:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23215] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23215] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23215] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23276] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23276] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23276] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23237] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23237] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23272] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23272] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23272] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23237] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23237] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23243] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23243] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23243] Checking distance from this process to device=mlx5_0
...
[1665518072.196751] [slurm-slehpc15-james-hpc-pg0-1:23230:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197006] [slurm-slehpc15-james-hpc-pg0-1:23221:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197333] [slurm-slehpc15-james-hpc-pg0-1:23228:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197729] [slurm-slehpc15-james-hpc-pg0-1:23244:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.199795] [slurm-slehpc15-james-hpc-pg0-1:23216:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199775] [slurm-slehpc15-james-hpc-pg0-1:23249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199779] [slurm-slehpc15-james-hpc-pg0-1:23253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199779] [slurm-slehpc15-james-hpc-pg0-1:23240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199859] [slurm-slehpc15-james-hpc-pg0-1:23255:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199851] [slurm-slehpc15-james-hpc-pg0-1:23217:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199851] [slurm-slehpc15-james-hpc-pg0-1:23212:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199878] [slurm-slehpc15-james-hpc-pg0-1:23229:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199854] [slurm-slehpc15-james-hpc-pg0-1:23236:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199875] [slurm-slehpc15-james-hpc-pg0-1:23225:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199865] [slurm-slehpc15-james-hpc-pg0-1:23249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665518072.199829] [slurm-slehpc15-james-hpc-pg0-1:23247:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199892] [slurm-slehpc15-james-hpc-pg0-1:23253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
...
[1665523321.395622] [slurm-slehpc15-james-hpc-pg0-2:23284:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.397063] [slurm-slehpc15-james-hpc-pg0-1:23252:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.398176] [slurm-slehpc15-james-hpc-pg0-1:23220:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.400071] [slurm-slehpc15-james-hpc-pg0-2:23254:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.493806] [slurm-slehpc15-james-hpc-pg0-1:23227:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494078] [slurm-slehpc15-james-hpc-pg0-1:23237:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494303] [slurm-slehpc15-james-hpc-pg0-1:23224:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494504] [slurm-slehpc15-james-hpc-pg0-1:23243:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494491] [slurm-slehpc15-james-hpc-pg0-1:23249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494582] [slurm-slehpc15-james-hpc-pg0-1:23238:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.495015] [slurm-slehpc15-james-hpc-pg0-1:23240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.495049] [slurm-slehpc15-james-hpc-pg0-1:23250:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.497471] [slurm-slehpc15-james-hpc-pg0-1:23225:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.497906] [slurm-slehpc15-james-hpc-pg0-1:23226:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.500558] [slurm-slehpc15-james-hpc-pg0-1:23221:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
...
[slurm-slehpc15-james-hpc-pg0-2:23266] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23266] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23225] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23225] mca: base: close: unloading component ofi
...
jamesongithub commented 1 year ago

glad we were able to get a successful run but would like to know how to get it working with the default parameters.

does this last result give us an idea of what should be changed to work with defaults?