Open jamesongithub opened 1 year ago
@hoopoepg any idea?
something wrong with proc file system. Are there containers used? try to add variable UCX_POSIX_USE_PROC_LINK=n to command line
@hoopoepg no containers
tried adding UCX_POSIX_USE_PROC_LINK=n
didn't see a difference
Please see log: https://gist.github.com/jamesongithub/ca1c9618f0dd994f6bf8356147111543
ok, it seems POSIX shm transport failed to access to shared memory.
could you try to exclude posix from transports? add UCX_TLS=^posix
variable to your command line
@hoopoepg
/proc errors gone, now are shmat errors:
[1665173472.252802] [slurm-slehpc15-james-hpc-pg0-12:44314:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655360) failed: Invalid argument
[1665173472.252818] [slurm-slehpc15-james-hpc-pg0-12:44314:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44311] pml_ucx.c:419 Error: ucp_ep_create(proc=502) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44313] pml_ucx.c:419 Error: ucp_ep_create(proc=502) failed: Shared memory error
[1665173472.252778] [slurm-slehpc15-james-hpc-pg0-12:44312:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655360) failed: Invalid argument
[1665173472.252792] [slurm-slehpc15-james-hpc-pg0-12:44312:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44312] pml_ucx.c:419 Error: ucp_ep_create(proc=502) failed: Shared memory error
[1665173472.254258] [slurm-slehpc15-james-hpc-pg0-3:44147:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655378) failed: Invalid argument
[1665173472.254273] [slurm-slehpc15-james-hpc-pg0-3:44147:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.252575] [slurm-slehpc15-james-hpc-pg0-12:44296:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655382) failed: Invalid argument
[1665173472.252585] [slurm-slehpc15-james-hpc-pg0-12:44296:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0016: Shared memory error
[slurm-slehpc15-james-hpc-pg0-3:44147] pml_ucx.c:419 Error: ucp_ep_create(proc=121) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44314] pml_ucx.c:419 Error: ucp_ep_create(proc=502) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-3:44190] pml_ucx.c:419 Error: ucp_ep_create(proc=121) failed: Shared memory error
[1665173472.252885] [slurm-slehpc15-james-hpc-pg0-12:44313:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655360) failed: Invalid argument
[1665173472.252902] [slurm-slehpc15-james-hpc-pg0-12:44313:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[1665173472.254221] [slurm-slehpc15-james-hpc-pg0-3:44148:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(sysv/memory cma/memory dc_mlx5/mlx5_0:1);
[slurm-slehpc15-james-hpc-pg0-12:44294] *** An error occurred in MPI_Init
[slurm-slehpc15-james-hpc-pg0-12:44294] *** reported by process [2433024001,506]
[slurm-slehpc15-james-hpc-pg0-12:44294] *** on a NULL communicator
[slurm-slehpc15-james-hpc-pg0-12:44294] *** Unknown error
[slurm-slehpc15-james-hpc-pg0-12:44294] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[slurm-slehpc15-james-hpc-pg0-12:44294] *** and potentially your MPI job)
[1665173472.254259] [slurm-slehpc15-james-hpc-pg0-3:44150:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(sysv/memory cma/memory dc_mlx5/mlx5_0:1);
[1665173472.254621] [slurm-slehpc15-james-hpc-pg0-3:44186:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655378) failed: Invalid argument
[1665173472.254633] [slurm-slehpc15-james-hpc-pg0-3:44186:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254622] [slurm-slehpc15-james-hpc-pg0-3:44187:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655378) failed: Invalid argument
[1665173472.254636] [slurm-slehpc15-james-hpc-pg0-3:44187:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254650] [slurm-slehpc15-james-hpc-pg0-3:44185:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655378) failed: Invalid argument
[1665173472.254667] [slurm-slehpc15-james-hpc-pg0-3:44185:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254558] [slurm-slehpc15-james-hpc-pg0-3:44188:0] mm_sysv.c:56 UCX ERROR shmat(shmid=655378) failed: Invalid argument
[1665173472.254576] [slurm-slehpc15-james-hpc-pg0-3:44188:0] mm_ep.c:159 UCX ERROR mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
it seems there are some restrictions to operate shared memory on your system - UCX can't use this transport at all.
to disable it add variable UCX_TLS=^sm
and it will allow to run your application
with UCX_TLS=^sm
still having issues.
[1665438954.954469] [slurm-slehpc15-james-hpc-pg0-2:26258:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27333] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27333] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:27334] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27334] Process is bound: distance to device is 0.000000
[1665438954.955595] [slurm-slehpc15-james-hpc-pg0-2:26262:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.956864] [slurm-slehpc15-james-hpc-pg0-2:26266:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.958393] [slurm-slehpc15-james-hpc-pg0-2:26264:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:27320] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27320] Process is bound: distance to device is 0.000000
[1665438954.961159] [slurm-slehpc15-james-hpc-pg0-2:26250:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.962243] [slurm-slehpc15-james-hpc-pg0-2:26263:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27331] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27331] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27331] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component vader returned success
[1665438954.964097] [slurm-slehpc15-james-hpc-pg0-2:26265:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27330] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27330] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27330] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27342] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27342] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27342] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27337] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27337] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27337] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component vader returned success
[1665438954.970425] [slurm-slehpc15-james-hpc-pg0-2:26251:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27303] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27303] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27303] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component vader returned success
[1665438954.972518] [slurm-slehpc15-james-hpc-pg0-2:26257:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27326] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27326] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27326] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component vader returned success
[1665438954.974107] [slurm-slehpc15-james-hpc-pg0-2:26260:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.974839] [slurm-slehpc15-james-hpc-pg0-2:26259:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.975534] [slurm-slehpc15-james-hpc-pg0-2:26271:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.977762] [slurm-slehpc15-james-hpc-pg0-2:26253:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27335] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27335] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27335] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27333] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27333] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27333] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27334] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component vader returned success
[1665438954.979615] [slurm-slehpc15-james-hpc-pg0-2:26255:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.980498] [slurm-slehpc15-james-hpc-pg0-2:26254:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.981717] [slurm-slehpc15-james-hpc-pg0-2:26272:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.982201] [slurm-slehpc15-james-hpc-pg0-2:26267:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.982706] [slurm-slehpc15-james-hpc-pg0-2:26268:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.983096] [slurm-slehpc15-james-hpc-pg0-2:26269:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.984676] [slurm-slehpc15-james-hpc-pg0-2:26270:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27320] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component vader returned success
[1665438954.992201] [slurm-slehpc15-james-hpc-pg0-1:27304:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.009533] [slurm-slehpc15-james-hpc-pg0-1:27306:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.078640] [slurm-slehpc15-james-hpc-pg0-1:27308:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.106084] [slurm-slehpc15-james-hpc-pg0-1:27300:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.120960] [slurm-slehpc15-james-hpc-pg0-1:27317:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.133759] [slurm-slehpc15-james-hpc-pg0-1:27315:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.135367] [slurm-slehpc15-james-hpc-pg0-1:27314:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.138056] [slurm-slehpc15-james-hpc-pg0-1:27299:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.152223] [slurm-slehpc15-james-hpc-pg0-1:27309:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.154816] [slurm-slehpc15-james-hpc-pg0-1:27310:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.157326] [slurm-slehpc15-james-hpc-pg0-1:27318:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.166507] [slurm-slehpc15-james-hpc-pg0-1:27302:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.167333] [slurm-slehpc15-james-hpc-pg0-1:27313:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.169634] [slurm-slehpc15-james-hpc-pg0-1:27321:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.172057] [slurm-slehpc15-james-hpc-pg0-1:27327:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.172603] [slurm-slehpc15-james-hpc-pg0-1:27340:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.177517] [slurm-slehpc15-james-hpc-pg0-1:27319:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.178461] [slurm-slehpc15-james-hpc-pg0-1:27341:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.179619] [slurm-slehpc15-james-hpc-pg0-1:27338:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.181082] [slurm-slehpc15-james-hpc-pg0-1:27329:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.182425] [slurm-slehpc15-james-hpc-pg0-1:27322:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.184469] [slurm-slehpc15-james-hpc-pg0-1:27324:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.189443] [slurm-slehpc15-james-hpc-pg0-1:27328:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.190392] [slurm-slehpc15-james-hpc-pg0-1:27325:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.192031] [slurm-slehpc15-james-hpc-pg0-1:27332:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.192613] [slurm-slehpc15-james-hpc-pg0-1:27336:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.194418] [slurm-slehpc15-james-hpc-pg0-1:27323:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.195034] [slurm-slehpc15-james-hpc-pg0-1:27339:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.196726] [slurm-slehpc15-james-hpc-pg0-1:27331:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.200069] [slurm-slehpc15-james-hpc-pg0-1:27330:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.200985] [slurm-slehpc15-james-hpc-pg0-1:27342:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.202910] [slurm-slehpc15-james-hpc-pg0-1:27337:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.203561] [slurm-slehpc15-james-hpc-pg0-1:27333:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.204374] [slurm-slehpc15-james-hpc-pg0-1:27326:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.205513] [slurm-slehpc15-james-hpc-pg0-1:27303:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.205674] [slurm-slehpc15-james-hpc-pg0-1:27335:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.206344] [slurm-slehpc15-james-hpc-pg0-1:27334:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.206871] [slurm-slehpc15-james-hpc-pg0-1:27320:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.216721] [slurm-slehpc15-james-hpc-pg0-1:27314:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216712] [slurm-slehpc15-james-hpc-pg0-1:27316:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216732] [slurm-slehpc15-james-hpc-pg0-1:27301:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216715] [slurm-slehpc15-james-hpc-pg0-1:27304:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216717] [slurm-slehpc15-james-hpc-pg0-1:27306:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216713] [slurm-slehpc15-james-hpc-pg0-1:27311:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216716] [slurm-slehpc15-james-hpc-pg0-1:27312:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216764] [slurm-slehpc15-james-hpc-pg0-1:27308:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27314:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216805] [slurm-slehpc15-james-hpc-pg0-1:27316:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216815] [slurm-slehpc15-james-hpc-pg0-1:27301:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27304:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216782] [slurm-slehpc15-james-hpc-pg0-1:27299:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27306:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216771] [slurm-slehpc15-james-hpc-pg0-1:27305:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216807] [slurm-slehpc15-james-hpc-pg0-1:27311:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216774] [slurm-slehpc15-james-hpc-pg0-1:27300:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216799] [slurm-slehpc15-james-hpc-pg0-1:27317:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216812] [slurm-slehpc15-james-hpc-pg0-1:27310:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216829] [slurm-slehpc15-james-hpc-pg0-1:27318:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216863] [slurm-slehpc15-james-hpc-pg0-1:27308:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216865] [slurm-slehpc15-james-hpc-pg0-1:27299:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216833] [slurm-slehpc15-james-hpc-pg0-1:27309:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216821] [slurm-slehpc15-james-hpc-pg0-1:27302:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216855] [slurm-slehpc15-james-hpc-pg0-1:27305:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216819] [slurm-slehpc15-james-hpc-pg0-1:27315:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216856] [slurm-slehpc15-james-hpc-pg0-1:27300:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216892] [slurm-slehpc15-james-hpc-pg0-1:27310:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216910] [slurm-slehpc15-james-hpc-pg0-1:27318:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216916] [slurm-slehpc15-james-hpc-pg0-1:27309:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216903] [slurm-slehpc15-james-hpc-pg0-1:27302:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216905] [slurm-slehpc15-james-hpc-pg0-1:27315:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[1665438955.216881] [slurm-slehpc15-james-hpc-pg0-1:27317:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216878] [slurm-slehpc15-james-hpc-pg0-1:27313:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216880] [slurm-slehpc15-james-hpc-pg0-1:27321:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216969] [slurm-slehpc15-james-hpc-pg0-1:27340:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216976] [slurm-slehpc15-james-hpc-pg0-1:27329:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216974] [slurm-slehpc15-james-hpc-pg0-1:27322:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216950] [slurm-slehpc15-james-hpc-pg0-1:27336:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216969] [slurm-slehpc15-james-hpc-pg0-1:27324:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27338:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216958] [slurm-slehpc15-james-hpc-pg0-1:27313:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216988] [slurm-slehpc15-james-hpc-pg0-1:27325:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216975] [slurm-slehpc15-james-hpc-pg0-1:27339:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216956] [slurm-slehpc15-james-hpc-pg0-1:27332:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27341:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27341:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27319:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27319:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[slurm-slehpc15-james-hpc-pg0-1:27328] [[21652,1],28] selected pml cm, but peer [[21652,1],0] on slurm-slehpc15-james-hpc-pg0-1 selected pml ucx
[slurm-slehpc15-james-hpc-pg0-2:26233] [[21652,1],48] selected pml cm, but peer [[21652,1],0] on slurm-slehpc15-james-hpc-pg0-1 selected pml ucx
[1665438955.216943] [slurm-slehpc15-james-hpc-pg0-1:27327:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27327:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.216959] [slurm-slehpc15-james-hpc-pg0-1:27321:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217002] [slurm-slehpc15-james-hpc-pg0-1:27312:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217053] [slurm-slehpc15-james-hpc-pg0-1:27340:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217054] [slurm-slehpc15-james-hpc-pg0-1:27329:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217049] [slurm-slehpc15-james-hpc-pg0-1:27322:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217025] [slurm-slehpc15-james-hpc-pg0-1:27336:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217051] [slurm-slehpc15-james-hpc-pg0-1:27324:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217053] [slurm-slehpc15-james-hpc-pg0-1:27338:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217063] [slurm-slehpc15-james-hpc-pg0-1:27331:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217144] [slurm-slehpc15-james-hpc-pg0-1:27331:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217041] [slurm-slehpc15-james-hpc-pg0-1:27337:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217117] [slurm-slehpc15-james-hpc-pg0-1:27337:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217074] [slurm-slehpc15-james-hpc-pg0-1:27325:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217156] [slurm-slehpc15-james-hpc-pg0-1:27335:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217232] [slurm-slehpc15-james-hpc-pg0-1:27335:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217088] [slurm-slehpc15-james-hpc-pg0-1:27303:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217167] [slurm-slehpc15-james-hpc-pg0-1:27303:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217042] [slurm-slehpc15-james-hpc-pg0-1:27342:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217120] [slurm-slehpc15-james-hpc-pg0-1:27342:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217049] [slurm-slehpc15-james-hpc-pg0-1:27339:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217035] [slurm-slehpc15-james-hpc-pg0-1:27332:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217069] [slurm-slehpc15-james-hpc-pg0-1:27323:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217148] [slurm-slehpc15-james-hpc-pg0-1:27323:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217069] [slurm-slehpc15-james-hpc-pg0-1:27330:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217147] [slurm-slehpc15-james-hpc-pg0-1:27330:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217106] [slurm-slehpc15-james-hpc-pg0-1:27320:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217184] [slurm-slehpc15-james-hpc-pg0-1:27320:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217096] [slurm-slehpc15-james-hpc-pg0-1:27326:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217175] [slurm-slehpc15-james-hpc-pg0-1:27326:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217134] [slurm-slehpc15-james-hpc-pg0-1:27307:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217218] [slurm-slehpc15-james-hpc-pg0-1:27307:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217105] [slurm-slehpc15-james-hpc-pg0-1:27334:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217182] [slurm-slehpc15-james-hpc-pg0-1:27334:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217082] [slurm-slehpc15-james-hpc-pg0-1:27333:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217155] [slurm-slehpc15-james-hpc-pg0-1:27333:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26235:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217517] [slurm-slehpc15-james-hpc-pg0-2:26232:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26237:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217518] [slurm-slehpc15-james-hpc-pg0-2:26234:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26234:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217517] [slurm-slehpc15-james-hpc-pg0-2:26231:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26231:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217521] [slurm-slehpc15-james-hpc-pg0-2:26230:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26230:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26236:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26236:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217523] [slurm-slehpc15-james-hpc-pg0-2:26240:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26240:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26235:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26232:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217615] [slurm-slehpc15-james-hpc-pg0-2:26237:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217598] [slurm-slehpc15-james-hpc-pg0-2:26239:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217617] [slurm-slehpc15-james-hpc-pg0-2:26238:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217699] [slurm-slehpc15-james-hpc-pg0-2:26238:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217625] [slurm-slehpc15-james-hpc-pg0-2:26229:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217706] [slurm-slehpc15-james-hpc-pg0-2:26229:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217625] [slurm-slehpc15-james-hpc-pg0-2:26248:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217704] [slurm-slehpc15-james-hpc-pg0-2:26248:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217618] [slurm-slehpc15-james-hpc-pg0-2:26246:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217695] [slurm-slehpc15-james-hpc-pg0-2:26246:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217634] [slurm-slehpc15-james-hpc-pg0-2:26244:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217720] [slurm-slehpc15-james-hpc-pg0-2:26244:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217615] [slurm-slehpc15-james-hpc-pg0-2:26247:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217576] [slurm-slehpc15-james-hpc-pg0-2:26242:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217659] [slurm-slehpc15-james-hpc-pg0-2:26242:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217623] [slurm-slehpc15-james-hpc-pg0-2:26245:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217705] [slurm-slehpc15-james-hpc-pg0-2:26245:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217603] [slurm-slehpc15-james-hpc-pg0-2:26243:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217682] [slurm-slehpc15-james-hpc-pg0-2:26243:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217712] [slurm-slehpc15-james-hpc-pg0-2:26239:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217760] [slurm-slehpc15-james-hpc-pg0-2:26247:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217786] [slurm-slehpc15-james-hpc-pg0-2:26264:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217792] [slurm-slehpc15-james-hpc-pg0-2:26265:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217894] [slurm-slehpc15-james-hpc-pg0-2:26265:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26256:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26256:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217806] [slurm-slehpc15-james-hpc-pg0-2:26271:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217908] [slurm-slehpc15-james-hpc-pg0-2:26271:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217820] [slurm-slehpc15-james-hpc-pg0-2:26255:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217917] [slurm-slehpc15-james-hpc-pg0-2:26255:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217721] [slurm-slehpc15-james-hpc-pg0-2:26261:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26261:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26253:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217917] [slurm-slehpc15-james-hpc-pg0-2:26253:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217826] [slurm-slehpc15-james-hpc-pg0-2:26251:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217925] [slurm-slehpc15-james-hpc-pg0-2:26251:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217720] [slurm-slehpc15-james-hpc-pg0-2:26252:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217810] [slurm-slehpc15-james-hpc-pg0-2:26252:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217798] [slurm-slehpc15-james-hpc-pg0-2:26257:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217893] [slurm-slehpc15-james-hpc-pg0-2:26257:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217750] [slurm-slehpc15-james-hpc-pg0-2:26260:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217833] [slurm-slehpc15-james-hpc-pg0-2:26260:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217725] [slurm-slehpc15-james-hpc-pg0-2:26249:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217819] [slurm-slehpc15-james-hpc-pg0-2:26249:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26258:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217820] [slurm-slehpc15-james-hpc-pg0-2:26258:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217727] [slurm-slehpc15-james-hpc-pg0-2:26250:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217801] [slurm-slehpc15-james-hpc-pg0-2:26250:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26266:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26266:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26263:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217918] [slurm-slehpc15-james-hpc-pg0-2:26263:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217776] [slurm-slehpc15-james-hpc-pg0-2:26262:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217885] [slurm-slehpc15-james-hpc-pg0-2:26262:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26259:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217914] [slurm-slehpc15-james-hpc-pg0-2:26259:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.218026] [slurm-slehpc15-james-hpc-pg0-2:26270:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217936] [slurm-slehpc15-james-hpc-pg0-2:26241:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.218021] [slurm-slehpc15-james-hpc-pg0-2:26241:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217908] [slurm-slehpc15-james-hpc-pg0-2:26264:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217962] [slurm-slehpc15-james-hpc-pg0-2:26268:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.218045] [slurm-slehpc15-james-hpc-pg0-2:26268:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217926] [slurm-slehpc15-james-hpc-pg0-2:26272:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.218004] [slurm-slehpc15-james-hpc-pg0-2:26272:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217931] [slurm-slehpc15-james-hpc-pg0-2:26267:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.218036] [slurm-slehpc15-james-hpc-pg0-2:26267:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217910] [slurm-slehpc15-james-hpc-pg0-2:26254:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.217992] [slurm-slehpc15-james-hpc-pg0-2:26254:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.217952] [slurm-slehpc15-james-hpc-pg0-2:26269:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665438955.218036] [slurm-slehpc15-james-hpc-pg0-2:26269:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665438955.218105] [slurm-slehpc15-james-hpc-pg0-2:26270:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[slurm-slehpc15-james-hpc-pg0-1:27328] *** An error occurred in MPI_Init
[slurm-slehpc15-james-hpc-pg0-1:27328] *** reported by process [1418985473,28]
[slurm-slehpc15-james-hpc-pg0-1:27328] *** on a NULL communicator
[slurm-slehpc15-james-hpc-pg0-1:27328] *** Unknown error
[slurm-slehpc15-james-hpc-pg0-1:27328] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[slurm-slehpc15-james-hpc-pg0-1:27328] *** and potentially your MPI job)
[slurm-slehpc15-james-hpc-pg0-1:27285] 87 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[slurm-slehpc15-james-hpc-pg0-1:27285] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[slurm-slehpc15-james-hpc-pg0-1:27285] 87 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[slurm-slehpc15-james-hpc-pg0-1:27285] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
[slurm-slehpc15-james-hpc-pg0-1:27285] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
preferably instead of disabling shared memory we can adjust system also since if we disable ucx completely we can get a successfully run
are these reasonable?
ipcs -l
------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509481983
max total shared memory (kbytes) = 4611686018427386880
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767
hi
I don't see any issues in ipcs -l
output - we are testing UCX on similar configuration and it works fine.
as I can see from logs UCX was able to startup, but some peers selected pml cm
instead of ucx
.
can you add -mca pml ucx
to command line to force using UCX?
thank you
hey with -mca pml ucx
i was able to get a successfully run. here is some output
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: slurm-slehpc15-james-hpc-pg0-2
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4120
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: slurm-slehpc15-james-hpc-pg0-2
Local device: mlx5_0
--------------------------------------------------------------------------
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23280] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23280] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23280] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-2:23296] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-2:23296] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23296] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23233] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23233] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23233] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23234] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23234] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23234] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23222] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23222] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23222] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23240] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23240] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23240] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23226] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23226] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23226] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23226] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23226] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23226] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23226] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23226] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23226] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
...
[1665518071.720969] [slurm-slehpc15-james-hpc-pg0-2:23292:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23215] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23215] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23215] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23276] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23276] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23276] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23237] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23237] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23272] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23272] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23272] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23237] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23237] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23243] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23243] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23243] Checking distance from this process to device=mlx5_0
...
[1665518072.196751] [slurm-slehpc15-james-hpc-pg0-1:23230:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197006] [slurm-slehpc15-james-hpc-pg0-1:23221:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197333] [slurm-slehpc15-james-hpc-pg0-1:23228:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197729] [slurm-slehpc15-james-hpc-pg0-1:23244:0] parser.c:1895 UCX INFO UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.199795] [slurm-slehpc15-james-hpc-pg0-1:23216:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199775] [slurm-slehpc15-james-hpc-pg0-1:23249:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199779] [slurm-slehpc15-james-hpc-pg0-1:23253:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199779] [slurm-slehpc15-james-hpc-pg0-1:23240:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199859] [slurm-slehpc15-james-hpc-pg0-1:23255:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199851] [slurm-slehpc15-james-hpc-pg0-1:23217:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199851] [slurm-slehpc15-james-hpc-pg0-1:23212:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199878] [slurm-slehpc15-james-hpc-pg0-1:23229:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199854] [slurm-slehpc15-james-hpc-pg0-1:23236:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199875] [slurm-slehpc15-james-hpc-pg0-1:23225:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199865] [slurm-slehpc15-james-hpc-pg0-1:23249:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
[1665518072.199829] [slurm-slehpc15-james-hpc-pg0-1:23247:0] ucp_worker.c:1777 UCX INFO ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1);
[1665518072.199892] [slurm-slehpc15-james-hpc-pg0-1:23253:0] ucp_worker.c:1777 UCX INFO ep_cfg[1]: tag(dc_mlx5/mlx5_0:1);
...
[1665523321.395622] [slurm-slehpc15-james-hpc-pg0-2:23284:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.397063] [slurm-slehpc15-james-hpc-pg0-1:23252:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.398176] [slurm-slehpc15-james-hpc-pg0-1:23220:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.400071] [slurm-slehpc15-james-hpc-pg0-2:23254:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.493806] [slurm-slehpc15-james-hpc-pg0-1:23227:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.494078] [slurm-slehpc15-james-hpc-pg0-1:23237:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.494303] [slurm-slehpc15-james-hpc-pg0-1:23224:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.494504] [slurm-slehpc15-james-hpc-pg0-1:23243:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.494491] [slurm-slehpc15-james-hpc-pg0-1:23249:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.494582] [slurm-slehpc15-james-hpc-pg0-1:23238:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.495015] [slurm-slehpc15-james-hpc-pg0-1:23240:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.495049] [slurm-slehpc15-james-hpc-pg0-1:23250:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.497471] [slurm-slehpc15-james-hpc-pg0-1:23225:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.497906] [slurm-slehpc15-james-hpc-pg0-1:23226:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
[1665523321.500558] [slurm-slehpc15-james-hpc-pg0-1:23221:0] ucp_worker.c:1777 UCX INFO ep_cfg[2]: tag(dc_mlx5/mlx5_0:1);
...
[slurm-slehpc15-james-hpc-pg0-2:23266] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23266] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23225] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23225] mca: base: close: unloading component ofi
...
glad we were able to get a successful run but would like to know how to get it working with the default parameters.
does this last result give us an idea of what should be changed to work with defaults?
Describe the bug
A clear and concise description of what the bug is. During an mpirun of hpl benchmark ucx errors were encountered which caused the job to fail.
The error message looks like the following:
Looks like a side issue that was reported in https://github.com/openucx/ucx/issues/4224 as well as https://github.com/easybuilders/easybuild/issues/756
Steps to Reproduce
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v
)Any UCX environment variables used
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
Additional information (depending on the issue)
Report bugs to http://www.open-mpi.org/community/help/
ucx_info -d If 'ucx_info' is not a typo you can use command-not-found to lookup the package that contains it, like this: cnf ucx_info