Closed kostrzewa closed 4 years ago
And here the same without GDR (and without the synchronisation problem), but with the issues of #798 with GCR aborting due to residual norm increases.
I think this issue of a failure with GDR might be related with the fact that with the driver provided with the latest 9.2 series CUDA, we are unable to load the nv_peer_mem
module (even after recompiling it, which works fine).
I think this issue of a failure with GDR might be related with the fact that with the driver provided with the latest 9.2 series CUDA, we are unable to load the
nv_peer_mem
module (even after recompiling it, which works fine).
I'm suprised though that CG works just fine with GDR enabled, even though the nv_peer_mem
module cannot be loaded.
Any luck coming up with a run of the QUDA unit test that triggers this error as well?
What I could suspect here is that there is some degeneracy in the autotuning string for two different parameters for the same kernel that leads to this failed kernel launch. Can you also check if the error persists if run with tuning disabled? (QUDA_ENABLE_TUNING=0
) . Ever seen the issue Piz Daint?
I've reproduced this with multigrid_invert_test
(c7012904286f3de5e02a0d91b0dabe666bbb11ec) now:
#!/bin/bash
#SBATCH --job-name=multigrid_invert_test_cB211.25.24
#SBATCH --mail-type=ALL
#SBATCH --mail-user=xxxxxxxxxxxxxxxxxxxxxxxxxx
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=62G
#SBATCH --time=06:00:00
#SBATCH --gres=gpu:kepler:4
#SBATCH --reservation=quda_kepler_testing
gdr=0
p2p=3
quda_id=quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests
gpu_arch=kepler
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/qbigwork2/bartek/libs/bleeding_edge/${gpu_arch}/${quda_id}/lib
exe=/hadron/bartek/build/bleeding_edge/${gpu_arch}/${quda_id}/tests/multigrid_invert_test
export QUDA_RESOURCE_PATH=/qbigwork2/bartek/misc/quda_resources/${gpu_arch}_${quda_id}_gdr${gdr}_p2p${p2p}
if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
mkdir -p ${QUDA_RESOURCE_PATH}
fi
valgrind= #valgrind
ppn=4
tpt=2
ARGS="--recon 12 --recon-sloppy 8 --prec double --nsrc 16
--dslash-type twisted-clover --compute-clover true --dim 24 24 12 12 --gridsize 1 1 2 4
--load-gauge /hiskp4/gauges/nf211/cB211a.25.24/conf.0000 --kappa 0.1394267 --mu 0.00072
--clover-coeff 0.235631123 --rank-order row --verbosity verbose --tol 1e-9"
MG_ARGS_COMMON="--prec-sloppy single --prec-precondition half --prec-null half
--recon-precondition 8 --mg-levels 3 --mg-block-size 0 3 4 3 3 --mg-block-size 1 2 3 2 2
--mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg
--mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr --mg-verbosity 0 verbose
--mg-verbosity 1 verbose --mg-verbosity 2 verbose --pipeline 8 --reliable-delta 7.5e-6 --ngcrkrylov 24"
MG_ARGS="--mg-mu-factor 2 70.0 --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr
--mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4
--mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 8 --mg-coarse-solver-maxiter 1 24
--mg-coarse-solver-maxiter 2 24 --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1
--mg-nvec 2 24"
export ARGS="${ARGS} ${MG_ARGS_COMMON} ${MG_ARGS}"
OMP_PLACES=cores OMP_PROC_BIND=close \
QUDA_RESOURCE_PATH=${QUDA_RESOURCE_PATH} OMP_NUM_THREADS=$tpt \
QUDA_ENABLE_GDR=${gdr} QUDA_ENABLE_P2P=${p2p} QUDA_ENABLE_TUNING=1 \
QUDA_ENABLE_DEVICE_MEMORY_POOL=0 \
time srun ${valgrind} ${exe} ${ARGS} 2>&1 | tee ${SLURM_JOB_NAME}_no_defl_mu0.00072.out
Here's the full log: multigrid_invert_test_cB211.25.24_no_defl_mu0.00072.txt
#!/bin/bash
#SBATCH --job-name=multigrid_invert_test_cB211.25.24
#SBATCH --mail-type=ALL
#SBATCH --mail-user=xxxxxxxxxxxxxxxxxxxxxx
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=62G
#SBATCH --time=06:00:00
#SBATCH --gres=gpu:kepler:4
#SBATCH --reservation=quda_kepler_testing
gdr=0
p2p=3
quda_id=quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests
gpu_arch=kepler
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/qbigwork2/bartek/libs/bleeding_edge/${gpu_arch}/${quda_id}/lib
exe=/hadron/bartek/build/bleeding_edge/${gpu_arch}/${quda_id}/tests/multigrid_invert_test
export QUDA_RESOURCE_PATH=/qbigwork2/bartek/misc/quda_resources/${gpu_arch}_${quda_id}_gdr${gdr}_p2p${p2p}
if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
mkdir -p ${QUDA_RESOURCE_PATH}
fi
valgrind= #valgrind
ppn=4
tpt=2
ARGS="--recon 12 --recon-sloppy 8 --prec double --nsrc 16
--dslash-type twisted-clover --compute-clover true --dim 24 24 12 12 --gridsize 1 1 2 4
--load-gauge /hiskp4/gauges/nf211/cB211a.25.24/conf.0000 --kappa 0.1394267 --mu 0.00072
--clover-coeff 0.235631123 --rank-order row --verbosity verbose --tol 1e-9"
MG_ARGS_COMMON="--prec-sloppy single --prec-precondition half --prec-null half
--recon-precondition 8 --mg-levels 3 --mg-block-size 0 3 4 3 3 --mg-block-size 1 2 3 2 2
--mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg
--mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr --mg-verbosity 0 verbose
--mg-verbosity 1 verbose --mg-verbosity 2 verbose --pipeline 8 --reliable-delta 7.5e-6 --ngcrkrylov 24"
MG_ARGS="--mg-mu-factor 2 70.0 --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr
--mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4
--mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 8 --mg-coarse-solver-maxiter 1 24
--mg-coarse-solver-maxiter 2 24 --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1
--mg-nvec 2 24"
export ARGS="${ARGS} ${MG_ARGS_COMMON} ${MG_ARGS}"
OMP_PLACES=cores OMP_PROC_BIND=close \
OMP_NUM_THREADS=$tpt \
QUDA_ENABLE_GDR=${gdr} QUDA_ENABLE_P2P=${p2p} QUDA_ENABLE_TUNING=0 \
QUDA_ENABLE_DEVICE_MEMORY_POOL=0 \
time srun ${valgrind} ${exe} ${ARGS} 2>&1 | tee ${SLURM_JOB_NAME}_NO_TUNING_no_defl_mu0.00072.out
And this also has the same convergence problems: multigrid_invert_test_cB211.25.24_NO_TUNING_no_defl_mu0.00072.txt
Ever seen the issue Piz Daint?
Not that I recall, no. I've also done multi-node inversions using QUDA-MG on our P100 nodes (in that case using 3 x 4 P100 for a 48c96 lattice before QUDA_TEX=OFF
was introduced, so this is quite a while ago...)
I can confirm that changing rank ordering fixes the issue, both with multigrid_invert_test_cB211.25.24_RANK_ORDER_COL_no_defl_mu0.00072.txt and without multigrid_invert_test_cB211.25.24_NO_TUNING_RANK_ORDER_COL_no_defl_mu0.00072.txt tuning.
I'll probably need to figure out if we can use this workaround from within the tmLQCD interface for the time being until it can be fixed.
Note that in the meantime we've also fixed the nv_peer_mem
issue on the cluster.
@maddyscientist if you'd like to test, I can keep the quda_kepler_testing
reservation for the time being (two nodes)
@kostrzewa thanks for the update on your testing of this. Can you confirm that this issue occurs with both QUDA_TEX=ON
and QUDA_TEX=OFF
?
@maddyscientist I'm afraid that QUDA_TEX
makes no difference. Below logs from code compiled with QUDA_TEX=ON
.
rank_order col with tuning: multigrid_invert_test_cB211.25.24_quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests-with_quda_tex_RANK_ORDER_COL_no_defl_mu0.00072.txt
rank_order col without tuning: multigrid_invert_test_cB211.25.24_quda_develop-dynamic_clover-c7012904286f3de5e02a0d91b0dabe666bbb11ec-with_tests-with_quda_tex_NO_TUNING_RANK_ORDER_COL_no_defl_mu0.00072.txt
After tests described in the comments to #988 I would like to close this issue as for every problem size that I tried, I was able to find some combination of parameters (gauge compression, GDR, P2P) which allows inversions to run through correctly. Wile there are still some strange behaviours, given that Kepler is being phased out, I'm not sure whether it would be good use of time to attempt to track down these residual issues. Thanks to @weinbe2 for the fix!
@kostrzewa thanks for the update. Apart from the remaining issue left in #988 that you reported, did you find any other issues remaining that you had to work around?
@maddyscientist All the cases except for the 24c48 lattice work fine without requiring any workarounds, so I'm quite happy :) Thanks.
If you find some time, could I ask for help with diagnosing what's happening here?
I will attach the full log below. I've tried different combos of gdr, p2p and when gdr is disabled, I still have the problems reported in #798, so the two issues might be related.