issues with multi-node MG on Kepler release/1.0.x branch

kostrzewa commented 5 years ago

If you find some time, could I ask for help with diagnosing what's happening here?

MG level 0 (GPU): ERROR: cuStreamSynchronize returned error CUDA_ERROR_LAUNCH_FAILED (rank 2, host lnode02, /qbigwork2/bartek/code/quda_1.0.x/lib/quda_cuda_api.cpp:278 in qudaStreamSynchronize())
MG level 0 (GPU):        last kernel called was (name=N4quda6WilsonIsLi4ELi3ENS_9WilsonArgIsLi3EL21QudaReconstructType_s18EEEEE,volume=16x32x16x16,aux=policy_kernel=interior,comm=0011,commDim=0011,xpay,dagger)

I will attach the full log below. I've tried different combos of gdr, p2p and when gdr is disabled, I still have the problems reported in #798, so the two issues might be related.

kostrzewa commented 5 years ago

production_cA211a.30.32_0000_gdr1_p2p1.out.txt

kostrzewa commented 5 years ago

And here the same without GDR (and without the synchronisation problem), but with the issues of #798 with GCR aborting due to residual norm increases.

production_cA211a.30.32_0000_gdr0_p2p3.out.head39805.txt

kostrzewa commented 5 years ago

I think this issue of a failure with GDR might be related with the fact that with the driver provided with the latest 9.2 series CUDA, we are unable to load the nv_peer_mem module (even after recompiling it, which works fine).

kostrzewa commented 5 years ago

I think this issue of a failure with GDR might be related with the fact that with the driver provided with the latest 9.2 series CUDA, we are unable to load the nv_peer_mem module (even after recompiling it, which works fine).

I'm suprised though that CG works just fine with GDR enabled, even though the nv_peer_mem module cannot be loaded.

maddyscientist commented 4 years ago

Any luck coming up with a run of the QUDA unit test that triggers this error as well?

What I could suspect here is that there is some degeneracy in the autotuning string for two different parameters for the same kernel that leads to this failed kernel launch. Can you also check if the error persists if run with tuning disabled? (QUDA_ENABLE_TUNING=0) . Ever seen the issue Piz Daint?

kostrzewa commented 4 years ago

I've reproduced this with multigrid_invert_test (c7012904286f3de5e02a0d91b0dabe666bbb11ec) now:

with tuning

#!/bin/bash
#SBATCH --job-name=multigrid_invert_test_cB211.25.24
#SBATCH --mail-type=ALL
#SBATCH --mail-user=xxxxxxxxxxxxxxxxxxxxxxxxxx
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=62G
#SBATCH --time=06:00:00
#SBATCH --gres=gpu:kepler:4
#SBATCH --reservation=quda_kepler_testing

gdr=0
p2p=3

quda_id=quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests
gpu_arch=kepler

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/qbigwork2/bartek/libs/bleeding_edge/${gpu_arch}/${quda_id}/lib

exe=/hadron/bartek/build/bleeding_edge/${gpu_arch}/${quda_id}/tests/multigrid_invert_test
export QUDA_RESOURCE_PATH=/qbigwork2/bartek/misc/quda_resources/${gpu_arch}_${quda_id}_gdr${gdr}_p2p${p2p}
if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
  mkdir -p ${QUDA_RESOURCE_PATH}
fi

valgrind= #valgrind

ppn=4
tpt=2

ARGS="--recon 12 --recon-sloppy 8 --prec double --nsrc 16
 --dslash-type twisted-clover --compute-clover true --dim 24 24 12 12 --gridsize 1 1 2 4
 --load-gauge /hiskp4/gauges/nf211/cB211a.25.24/conf.0000 --kappa 0.1394267 --mu 0.00072
 --clover-coeff 0.235631123 --rank-order row --verbosity verbose --tol 1e-9"

MG_ARGS_COMMON="--prec-sloppy single --prec-precondition half --prec-null half
 --recon-precondition 8 --mg-levels 3 --mg-block-size 0 3 4 3 3 --mg-block-size 1 2 3 2 2
 --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg 
 --mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr --mg-verbosity 0 verbose
 --mg-verbosity 1 verbose --mg-verbosity 2 verbose --pipeline 8 --reliable-delta 7.5e-6 --ngcrkrylov 24"

MG_ARGS="--mg-mu-factor 2 70.0 --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr
 --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4
 --mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 8 --mg-coarse-solver-maxiter 1 24
 --mg-coarse-solver-maxiter 2 24 --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1
 --mg-nvec 2 24"

export ARGS="${ARGS} ${MG_ARGS_COMMON} ${MG_ARGS}"

OMP_PLACES=cores OMP_PROC_BIND=close \
  QUDA_RESOURCE_PATH=${QUDA_RESOURCE_PATH} OMP_NUM_THREADS=$tpt \
  QUDA_ENABLE_GDR=${gdr} QUDA_ENABLE_P2P=${p2p} QUDA_ENABLE_TUNING=1 \
  QUDA_ENABLE_DEVICE_MEMORY_POOL=0 \
  time srun ${valgrind} ${exe} ${ARGS} 2>&1 | tee ${SLURM_JOB_NAME}_no_defl_mu0.00072.out

Here's the full log: multigrid_invert_test_cB211.25.24_no_defl_mu0.00072.txt

without tuning

#!/bin/bash
#SBATCH --job-name=multigrid_invert_test_cB211.25.24
#SBATCH --mail-type=ALL
#SBATCH --mail-user=xxxxxxxxxxxxxxxxxxxxxx
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=62G
#SBATCH --time=06:00:00
#SBATCH --gres=gpu:kepler:4
#SBATCH --reservation=quda_kepler_testing

gdr=0
p2p=3

quda_id=quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests
gpu_arch=kepler

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/qbigwork2/bartek/libs/bleeding_edge/${gpu_arch}/${quda_id}/lib

exe=/hadron/bartek/build/bleeding_edge/${gpu_arch}/${quda_id}/tests/multigrid_invert_test
export QUDA_RESOURCE_PATH=/qbigwork2/bartek/misc/quda_resources/${gpu_arch}_${quda_id}_gdr${gdr}_p2p${p2p}
if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
  mkdir -p ${QUDA_RESOURCE_PATH}
fi

valgrind= #valgrind

ppn=4
tpt=2

ARGS="--recon 12 --recon-sloppy 8 --prec double --nsrc 16
 --dslash-type twisted-clover --compute-clover true --dim 24 24 12 12 --gridsize 1 1 2 4
 --load-gauge /hiskp4/gauges/nf211/cB211a.25.24/conf.0000 --kappa 0.1394267 --mu 0.00072
 --clover-coeff 0.235631123 --rank-order row --verbosity verbose --tol 1e-9"

MG_ARGS_COMMON="--prec-sloppy single --prec-precondition half --prec-null half
 --recon-precondition 8 --mg-levels 3 --mg-block-size 0 3 4 3 3 --mg-block-size 1 2 3 2 2
 --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg 
 --mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr --mg-verbosity 0 verbose
 --mg-verbosity 1 verbose --mg-verbosity 2 verbose --pipeline 8 --reliable-delta 7.5e-6 --ngcrkrylov 24"

MG_ARGS="--mg-mu-factor 2 70.0 --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr
 --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4
 --mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 8 --mg-coarse-solver-maxiter 1 24
 --mg-coarse-solver-maxiter 2 24 --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1
 --mg-nvec 2 24"

export ARGS="${ARGS} ${MG_ARGS_COMMON} ${MG_ARGS}"
OMP_PLACES=cores OMP_PROC_BIND=close \
  OMP_NUM_THREADS=$tpt \
  QUDA_ENABLE_GDR=${gdr} QUDA_ENABLE_P2P=${p2p} QUDA_ENABLE_TUNING=0 \
  QUDA_ENABLE_DEVICE_MEMORY_POOL=0 \
  time srun ${valgrind} ${exe} ${ARGS} 2>&1 | tee ${SLURM_JOB_NAME}_NO_TUNING_no_defl_mu0.00072.out

And this also has the same convergence problems: multigrid_invert_test_cB211.25.24_NO_TUNING_no_defl_mu0.00072.txt

kostrzewa commented 4 years ago

Ever seen the issue Piz Daint?

Not that I recall, no. I've also done multi-node inversions using QUDA-MG on our P100 nodes (in that case using 3 x 4 P100 for a 48c96 lattice before QUDA_TEX=OFF was introduced, so this is quite a while ago...)

kostrzewa commented 4 years ago

I can confirm that changing rank ordering fixes the issue, both with multigrid_invert_test_cB211.25.24_RANK_ORDER_COL_no_defl_mu0.00072.txt and without multigrid_invert_test_cB211.25.24_NO_TUNING_RANK_ORDER_COL_no_defl_mu0.00072.txt tuning.

I'll probably need to figure out if we can use this workaround from within the tmLQCD interface for the time being until it can be fixed.

kostrzewa commented 4 years ago

Note that in the meantime we've also fixed the nv_peer_mem issue on the cluster.

kostrzewa commented 4 years ago

@maddyscientist if you'd like to test, I can keep the quda_kepler_testing reservation for the time being (two nodes)

maddyscientist commented 4 years ago

@kostrzewa thanks for the update on your testing of this. Can you confirm that this issue occurs with both QUDA_TEX=ON and QUDA_TEX=OFF?

kostrzewa commented 4 years ago

@maddyscientist I'm afraid that QUDA_TEX makes no difference. Below logs from code compiled with QUDA_TEX=ON.

with tuning: multigrid_invert_test_cB211.25.24_quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests-with_quda_tex_no_defl_mu0.00072.txt

without tuning: multigrid_invert_test_cB211.25.24_quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests-with_quda_tex_NO_TUNING_no_defl_mu0.00072.txt

rank_order col with tuning: multigrid_invert_test_cB211.25.24_quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests-with_quda_tex_RANK_ORDER_COL_no_defl_mu0.00072.txt

rank_order col without tuning: multigrid_invert_test_cB211.25.24_quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests-with_quda_tex_NO_TUNING_RANK_ORDER_COL_no_defl_mu0.00072.txt

kostrzewa commented 4 years ago

After tests described in the comments to #988 I would like to close this issue as for every problem size that I tried, I was able to find some combination of parameters (gauge compression, GDR, P2P) which allows inversions to run through correctly. Wile there are still some strange behaviours, given that Kepler is being phased out, I'm not sure whether it would be good use of time to attempt to track down these residual issues. Thanks to @weinbe2 for the fix!

maddyscientist commented 4 years ago

@kostrzewa thanks for the update. Apart from the remaining issue left in #988 that you reported, did you find any other issues remaining that you had to work around?

kostrzewa commented 4 years ago

@maddyscientist All the cases except for the 24c48 lattice work fine without requiring any workarounds, so I'm quite happy :) Thanks.

lattice / quda

issues with multi-node MG on Kepler release/1.0.x branch #895

with tuning

without tuning