Closed Lightup1 closed 1 year ago
What happens if you set transpose_method = Transpositions.Alltoallv()
? (See here for details.)
hang on with out
:
rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
data size:(5120, 32, 32)
Start data allocationg
[1656149516.868709] [gpu53:309102:0] cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 10485760, error message Bad address
[1656149516.868708] [gpu53:309100:0] cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 10485760, error message Bad address
[1656149516.894273] [gpu53:309103:0] cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 10485760, error message Bad address
[1656149516.894271] [gpu53:309101:0] cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 10485760, error message Bad address
Then job cancelled due to time limit I set.
Have you tried the hints over at the MPI docs? In particular:
export JULIA_CUDA_MEMORY_POOL=none
Seems not working
without export JULIA_CUDA_MEMORY_POOL=none
rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(5120, 32, 32)
Start data allocationg
[1656470804.862083] [gpu13:416431:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x60ac00400, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470804.862186] [gpu13:416431:0] ucp_mm.c:110 UCX ERROR failed to register address 0x60ac00400 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470804.862194] [gpu13:416431:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x60ac00400 len 10485760: Input/output error
[1656470804.925721] [gpu13:416430:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x60ac00400, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470804.925771] [gpu13:416430:0] ucp_mm.c:110 UCX ERROR failed to register address 0x60ac00400 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470804.925780] [gpu13:416430:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x60ac00400 len 10485760: Input/output error
[1656470804.930835] [gpu13:416429:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x60ac00400, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470804.930879] [gpu13:416429:0] ucp_mm.c:110 UCX ERROR failed to register address 0x60ac00400 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470804.930902] [gpu13:416429:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x60ac00400 len 10485760: Input/output error
[1656470805.223480] [gpu13:416428:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x60ac00400, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470805.223550] [gpu13:416428:0] ucp_mm.c:110 UCX ERROR failed to register address 0x60ac00400 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470805.223559] [gpu13:416428:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x60ac00400 len 10485760: Input/output error
with export JULIA_CUDA_MEMORY_POOL=none
:
rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(5120, 32, 32)
Start data allocationg
[1656470932.716062] [gpu21:57697:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x2b647b400000, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470932.716119] [gpu21:57697:0] ucp_mm.c:110 UCX ERROR failed to register address 0x2b647b400000 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470932.716127] [gpu21:57697:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x2b647b400000 len 10485760: Input/output error
[1656470932.735698] [gpu21:57698:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x2b338b400000, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470932.735749] [gpu21:57698:0] ucp_mm.c:110 UCX ERROR failed to register address 0x2b338b400000 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470932.735758] [gpu21:57698:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x2b338b400000 len 10485760: Input/output error
[1656470932.746492] [gpu21:57700:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x2b4545400000, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470932.746538] [gpu21:57700:0] ucp_mm.c:110 UCX ERROR failed to register address 0x2b4545400000 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470932.746561] [gpu21:57700:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x2b4545400000 len 10485760: Input/output error
[1656470932.774987] [gpu21:57699:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x2ba6eb400000, length=10485760, access=0xf) failed: Cannot allocate memory
[1656470932.775034] [gpu21:57699:0] ucp_mm.c:110 UCX ERROR failed to register address 0x2ba6eb400000 length 10485760 on md[5]=ib/mlx5_0: Input/output error
[1656470932.775043] [gpu21:57699:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x80 address 0x2ba6eb400000 len 10485760: Input/output error
batch
file:
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH -J gpuN1p4TCP_CUDAPOOL_none # N nodes p process t threads
#SBATCH --cpus-per-task=7 # th2 qiming gpu 28 cpus per node (adjust as you need)
#SBATCH --time=00:3:00 # days-hours:minutes:seconds
#SBATCH -p gpu_v100
#SBATCH --output=slurm-%x-%j.out
#SBATCH --error=slurm-%x-%j.err
source /GPUFS/app/MPI/openmpi/3.1.4-gcc45-cuda11.0/env.sh
export LD_LIBRARY_PATH=$HOME/.julia-1.7.3/lib/julia:$LD_LIBRARY_PATH
export JULIA_CUDA_MEMORY_POOL=none
export JULIA_CUDA_USE_BINARYBUILDER=false
julia --project -e 'using Pkg; Pkg.instantiate()'
julia --project -e 'using Pkg; Pkg.precompile()'
srun hostname>hostlist
mpiexecjl --project --mca btl tcp,self,vader --mca btl_tcp_if_include ib0 --machinefile hostlist -np $SLURM_NTASKS julia -t7 gpubench.jl
After I install openmpi and ucx and cuda by myself, the error dissappeared. It seems that our cluster manager do not install openmpi with ucx. I'll close the issue.
For single node with 4 Tesla V100 GPUs, a system mpi
openmpi/3.1.4-gcc45-cuda11.0
built withoutgdrcopy
are used. bench.sh:gpuben.jl:
err:
out: