lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

Staggered heavy-quark residual fails to regulate HISQ CG convergence properly with an odd checkerboard source #1376

Open detar opened 1 year ago

detar commented 1 year ago

When the inversion source has support on only even sites, convergence seems to be OK with a tiny heavy-quark residual. But when a source has support on only odd sites, the residual decreases at a glacial rate while the inverter runs out of restarts, leaving a heavy-quark residual of order one or two hundred. The propagator solution at moderate distance from the source seems to be very sensitive to the input stopping conditions, so I suspect it is not properly converged.

weinbe2 commented 1 year ago

Thanks for the info, Carleton. Do you have a reference MILC input file I can use to reproduce this? Also, what ensemble(s) have you been seeing this on?

detar commented 1 year ago

Thank you for helping with this, Evan.  I should say first that I am building QUDA with Jim Simone's branch, "not_a_feature_rather_a_hope/staggered_correlator_gk", but he has been merging develop into it.  In retrospect, I found the same problem in output logs from Perlmutter with QUDA/develop.  So the problem should also be reproducible with the develop branch.   It will take a little time to create the reproducer.   In the mean time, it would be worth looking at the code to see if there is some obvious difference in the heavy-quark residual treatment between an exclusively even-site and an exclusively odd-site source.

On 5/1/23 10:22 AM, Evan Weinberg wrote:

Thanks for the info, Carleton. Do you have a reference MILC input file I can use to reproduce this? Also, what ensemble(s) have you been seeing this on?

— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1529917151, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXSM2TMDH2LM7DXVTZLXD7PMPANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>

weinbe2 commented 1 year ago

Thanks Carleton, the reproducer may be necessary so I can understand the full workflow. The CG code "doesn't know" about even/odd, it's just handed an operator. The stencil code knows the bare minimum necessary, most relevantly in the prepare/reconstruct bits. You can double check here, though I just did a skim and all looks good (aka "symmetric" between even and odd), check the code for the DiracImprovedStaggeredPC operator: https://github.com/lattice/quda/blob/develop/lib/dirac_improved_staggered.cpp

Remind me, does MILC use the odd Schur op directly, or does it prepare an even source, use the even op, and reconstruct?

detar commented 1 year ago

Thanks, Evan.  Do you have access to Frontier?  Or should I port the reproducer to Summit?

MILC calls QUDA specifying the "parity" of the solve.  For a strictly odd parity source (rhs), its even-odd block decomposition has the form

 [   0    ]

 [  b_o ]

and, when QUDA is called, specifying odd parity, the solution should have the form

 [  0     ]

 [  B b_o ]

where B = 1/(D^2 + 4m^2).

The MILC code "reconsructs" by multiplying by M^\dagger:

[  -D B b_o  ]

[  2m B b_o ]

For a strictly even-parity source with a call to QUDA specifying even parity, we just interchange even and odd here.  There is nothing asymmetric in the MILC treatment outside QUDA.

On 5/2/23 9:18 AM, Evan Weinberg wrote:

Thanks Carleton, the reproducer may be necessary so I can understand the full workflow. The CG code "doesn't know" about even/odd, it's just handed an operator. The stencil code knows the bare minimum necessary, most relevantly in the prepare/reconstruct bits. You can double check here, though I just did a skim and all looks good (aka "symmetric" between even and odd), check the code for the |DiracImprovedStaggeredPC| operator: https://github.com/lattice/quda/blob/develop/lib/dirac_improved_staggered.cpp

Remind me, does MILC use the odd Schur op directly, or does it prepare an even source, use the even op, and reconstruct?

— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1531662914, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXRT2PHHSMBTTKNVUFDXEEQVFANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>

weinbe2 commented 1 year ago

Thanks Carleton. I'm in the moving and I'm not quite sure where my keyfob is right now---can you send me your submit script and input file via Slack or e-mail? I have configurations of various sizes (64^3, 96^3, 144^3, 192^3) on our internal cluster, hopefully one of those will suffice to reproduce the issue.

Also, thank you for describing the measurement. Between your description and my code investigations, I can't spot any inherent issue/asymmetry, but extra investigations will clearly be in order.

weinbe2 commented 1 year ago

One question---is the host source in MILC single parity or the length of the full volume? It looks like qudaInvert is assuming it is a full volume source (contiguous even and odd), and as such there is an offset into just the odd part. This has been the assumption in the code for a very, very long time---I checked, and it's been this way for 10 years!!

If in reality the source in MILC is single parity (just odd), this could give garbage---and if the off chance the memory it points to is zero, it's reasonable that it would be a very slow (as well as meaningless) solve.

For reference:

detar commented 1 year ago

All of the MILC color vector field are full.

On 5/3/23 9:03 AM, Evan Weinberg wrote:

One question---is the host source in MILC single parity or the length of the full volume? It looks like |qudaInvert| is assuming it is a full volume source (contiguous even and odd), and as such there is an offset into just the odd part. This has been the assumption in the code for a very, very long time---I checked, and it's been this way for 10 years!!

If in reality the source in MILC is single parity (just odd), this could give garbage---and if the off chance the memory it points to is zero, it's reasonable that it would be a very slow (as well as meaningless) solve.

For reference:

— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1533203090, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXR3JPQKZP25GVDMWUDXEJXVLANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>

detar commented 1 year ago

Hi Evan,

I sent a bunch of files in my last message, but I just discovered that our MIMEdefang probably deleted the shell script attachments.   So here they are with modifed extensions.

Best,

Carleton

On 5/3/23 9:03 AM, Evan Weinberg wrote:

One question---is the host source in MILC single parity or the length of the full volume? It looks like |qudaInvert| is assuming it is a full volume source (contiguous even and odd), and as such there is an offset into just the odd part. This has been the assumption in the code for a very, very long time---I checked, and it's been this way for 10 years!!

If in reality the source in MILC is single parity (just odd), this could give garbage---and if the off chance the memory it points to is zero, it's reasonable that it would be a very slow (as well as meaningless) solve.

For reference:

— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1533203090, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXR3JPQKZP25GVDMWUDXEJXVLANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>

! /bin/bash

source env.sh

QUDA_INSTALL=${INSTALLROOT}/quda

LIBQUDA="-Wl,-rpath ${QUDA_INSTALL}/lib -L${QUDA_INSTALL}/lib -lquda -D__gfx90a --amdgpu-target=gfx90a -Wl,-rpath=${ROCM_PATH}/hiprand/lib -L${ROCM_PATH}/hiprand/lib -Wl,-rpath=${ROCM_PATH}/rocfft/lib -L${ROCM_PATH}/rocfft/lib -lhiprand -lrocfft -Wl,-rpath=${ROCM_PATH}/hipblas/lib -L${ROCM_PATH}/hipblas/lib -lhipblas -Wl,-rpath=${ROCM_PATH}/rocblas/lib -L${ROCM_PATH}/rocblas/lib -lrocblas -Wl,-rpath=${ROCM_PATH}/hip/lib"

############ Make ks_spectrum_hisq ################## cd milc_qcd/ks_spectrum cp ../Makefile . make clean

MY_CC=hipcc \ MY_CXX=hipcc \ ARCH="" \ COMPILER="gnu" \ OFFLOAD="HIP" \ OPT="-O3 -Ofast -g" \ PATH_TO_NVHPCSDK="" \ CUDA_HOME="" \ QUDA_HOME=${QUDA_INSTALL} \ QUDA_VERBOSITY=VERBOSE \ WANTQUDA=true \ WANT_FN_CG_GPU=true \ WANT_FL_GPU=true \ WANT_GF_GPU=true \ WANT_FF_GPU=true \ WANT_KS_CONT_GPU=true \ WANT_SHIFT_GPU=true \ WANT_SPIN_TASTE_GPU=true \ WANT_GAUGEFIX_OVR_GPU=true \ WANT_MIXED_PRECISION_GPU=1 \ PRECISION=2 \ MPP=true \ OMP=true \ WANTQIO=true \ WANTQMP=true \ QIOPAR=/ccs/home/detar/frontier/quda/install/quda \ QMPPAR=/ccs/home/detar/frontier/quda/install/quda \ LIBQUDA=${LIBQUDA} \ CGEOM="-DFIX_NODE_GEOM -DFIX_IONODE_GEOM" \ KSCGMULTI="-DKS_MULTICG=HYBRID -DMULTISOURCE -DMULTIGRID" \ CTIME="-DNERSC_TIME -DCGTIME -DFFTIME -DFLTIME -DGFTIME -DREMAP -DPRTIME -DIOTIME" \ make -j 1 ks_spectrum_hisq cd ..

############ Make su3_rhmd_hisq ################## cd ks_imp_rhmc cp ../Makefile . make clean

MY_CC=hipcc \ MY_CXX=hipcc \ ARCH="" \ COMPILER="gnu" \ OFFLOAD="HIP" \ OPT="-O3 -Ofast" \ PATH_TO_NVHPCSDK="" \ CUDA_HOME="" \ QUDA_HOME=${QUDA_INSTALL} \ WANTQUDA=true \ WANT_FN_CG_GPU=true \ WANT_FL_GPU=true \ WANT_GF_GPU=true \ WANT_FF_GPU=true \ WANT_GAUGEFIX_OVR_GPU=true \ WANT_MIXED_PRECISION_GPU=2 \ PRECISION=1 \ MPP=true \ OMP=true \ WANTQIO=true \ WANTQMP=true \ QIOPAR=/ccs/home/detar/frontier/quda/install/quda \ QMPPAR=/ccs/home/detar/frontier/quda/install/quda \ LIBQUDA=${LIBQUDA} \ CGEOM="-DFIX_NODE_GEOM -DFIX_IONODE_GEOM" \ KSCGMULTI="-DKS_MULTICG=HYBRID -DMULTISOURCE -DMULTIGRID" \ CTIME="-DNERSC_TIME -DCGTIME -DFFTIME -DFLTIME -DGFTIME -DREMAP -DPRTIME -DIOTIME" \ make -j 1 su3_rhmd_hisq cd ../..

! /bin/bash

BRANCH=not_a_feature_rather_a_hope/staggered_correlator_gk

BRANCH=develop

source env.sh

pushd quda

QUDA_HOME=$(pwd)

if [ -d quda ] then cd quda git pull

git checkout develop

git checkout ${BRANCH} else git clone --branch ${BRANCH} https://github.com/lattice/quda cd quda git checkout ${BRANCH} fi cd ..

mkdir -p build && cd build

cmake \ -DCMAKE_BUILD_TYPE=RELEASE \ -DCMAKE_CXX_COMPILER=CC \ -DCMAKE_CXX_FLAGS="--offload-arch=gfx90a" \ -DCMAKE_C_COMPILER=cc \ -DCMAKE_C_FLAGS="--offload-arch=gfx90a" \ -DCMAKE_C_STANDARD=99 \ -DCMAKE_EXE_LINKER_FLAGS="--offload-arch=gfx90a" \ -DCMAKE_HIP_FLAGS="--offload-arch=gfx90a" \ -DCMAKE_INSTALL_PREFIX=${INSTALLROOT}/quda \ -DQUDA_BUILD_SHAREDLIB=ON \ -DQUDA_CONTRACT=ON \ -DQUDA_COVDEV=ON \ -DQUDA_DIRAC_DEFAULT_OFF=ON \ -DQUDA_DIRAC_STAGGERED=ON \ -DQUDA_DOWNLOAD_USQCD=ON \ -DQUDA_GPU_ARCH=gfx90a \ -DQUDA_QIO=ON \ -DQUDA_QMP=ON \ -DQUDA_TARGET_TYPE=HIP \ -DROCM_PATH=${ROCM_PATH} \ ${QUDA_HOME}/quda

-DCMAKE_SHARED_LINKER_FLAGS="—-offload-arch=gfx90a" \

make -j16 install

cd ..

! /bin/bash

SBATCH -t 10:00

SBATCH -N 96

SBATCH -n 768

SBATCH --cpus-per-task=6

SBATCH --ntasks-per-node=8

####### -J (command line in spawnjob.py)

SBATCH -A phy157-ecphisq

SBATCH -V

####### -C nvme

SBATCH -S 2

Submission command must define environment the variable RUNCMDFILE

sbatch -N ${NODES} -t ${walltime} -J ${jobname} ${slurm_script}

nodes=96 umask 0022

source env.sh

Run production jobs out of $SCRATCH

SCRATCH_HOME=/gpfs/alpine/proj-shared/phy157/phy157hisq/detar/allHISQ/frontier MYSCRATCH=${SCRATCH_HOME}/l144288f211b700m000569m01555m1827 mkdir -p ${MYSCRATCH} cd ${MYSCRATCH}

QUDA flags

export QUDA_ENABLE_GDR=1 export QUDA_ENABLE_P2P=1 export QUDA_MILC_HISQ_RECONSTRUCT=13 export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 export QUDA_RESOURCE_PATH=pwd # location of QUDA tunecache file

MPICH flags

export MPICH_GPU_SUPPORT_ENABLED=1 # Allow GDR export MPICH_COLL_SYNC=MPI_Bcast # Sometimes broadcast is not synchronizing.... export MPICH_RDMA_ENABLED_CUDA=1 export MPICH_OFI_NIC_POLICY=BLOCK export MPICH_SMP_SINGLE_COPY_MODE=CMA

Balint 6-CPU Masks for Frontier

export OMP_NUM_THREADS=6 export OMP_PROC_BIND=spread MASK_0="0x003f000000000000" MASK_1="0x3f00000000000000" MASK_2="0x00000000003f0000" MASK_3="0x000000003f000000" MASK_4="0x000000000000007e" MASK_5="0x0000000000007e00" MASK_6="0x0000003f00000000" MASK_7="0x00003f0000000000" MEMBIND="--mem-bind=map_mem:3,3,1,1,0,0,2,2" CPU_MASK="--cpu-bind=mask_cpu:${MASK_0},${MASK_1},${MASK_2},${MASK_3},${MASK_4},${MASK_5},${MASK_6},${MASK_7}"

srun -n 768 -N 96 ks_spectrum_hisq --distribution=*:block ${CPU_MASK} -qmp-geom 2 4 4 24 -qmp-alloc-map 3 2 1 0 -qmp-logic-map 3 2 1 0 intest outtest

! /bin/bash

SBATCH -t 10:00

SBATCH -n 768

SBATCH --cpus-per-task=6

SBATCH --ntasks-per-node=8

####### -J (command line in spawnjob.py)

SBATCH -A phy157-ecphisq

SBATCH -V

####### -C nvme

SBATCH -S 2

Submission command must define environment the variable RUNCMDFILE

sbatch -N ${NODES} -t ${walltime} -J ${jobname} ${slurm_script}

nodes=96 umask 0022

source env.sh

Run production jobs out of $SCRATCH

SCRATCH_HOME=/gpfs/alpine/proj-shared/phy157/phy157hisq/detar/allHISQ/frontier MYSCRATCH=${SCRATCH_HOME}/l144288f211b700m000569m01555m1827 mkdir -p ${MYSCRATCH} cd ${MYSCRATCH}

QUDA flags

export QUDA_ENABLE_GDR=1 export QUDA_ENABLE_P2P=1 export QUDA_MILC_HISQ_RECONSTRUCT=13 export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 export QUDA_RESOURCE_PATH=pwd # location of QUDA tunecache file

MPICH flags

export MPICH_GPU_SUPPORT_ENABLED=1 # Allow GDR export MPICH_COLL_SYNC=MPI_Bcast # Sometimes broadcast is not synchronizing.... export MPICH_RDMA_ENABLED_CUDA=1 export MPICH_OFI_NIC_POLICY=BLOCK export MPICH_SMP_SINGLE_COPY_MODE=CMA

Balint 6-CPU Masks for Frontier

export OMP_NUM_THREADS=6 export OMP_PROC_BIND=spread MASK_0="0x003f000000000000" MASK_1="0x3f00000000000000" MASK_2="0x00000000003f0000" MASK_3="0x000000003f000000" MASK_4="0x000000000000007e" MASK_5="0x0000000000007e00" MASK_6="0x0000003f00000000" MASK_7="0x00003f0000000000" MEMBIND="--mem-bind=map_mem:3,3,1,1,0,0,2,2" CPU_MASK="--cpu-bind=mask_cpu:${MASK_0},${MASK_1},${MASK_2},${MASK_3},${MASK_4},${MASK_5},${MASK_6},${MASK_7}"

srun -n 768 -N 96 ks_spectrum_hisq --distribution=*:block ${CPU_MASK} -qmp-geom 2 4 4 24 -qmp-alloc-map 3 2 1 0 -qmp-logic-map 3 2 1 0 intest outtest

== Geometry ==

prompt 0 nx 144 ny 144 nz 144 nt 288 node_geometry 2 4 4 24 ionode_geometry 2 4 4 24 iseed 402129 job_id 1311221

== Gauge ==

reload_parallel /lustre/orion/proj-shared/phy157/phy157_hisq/detar/allHISQ/l144288f211b700m000569m01555m1827/lat/v5/l144288f211b700m000569m01555m1827a.402 u0 1 no_gauge_fix forget staple_weight 0.05 ape_iter 20 coordinate_origin 0 0 0 0 time_bc antiperiodic

== Eigen ==

max_number_of_eigenpairs 0

== PBP Masses ==

number_of_pbp_masses 0

== Base Sources ==

number_of_base_sources 1

== source 0: RandomColorWallSource ==

random_color_wall field_type KS subset corner t0 129 ncolor 1 momentum 0 0 0 source_label RW forget_source

== Modified Sources ==

number_of_modified_sources 0

== KSsolveSets ==

number_of_sets 1

== KSsolveSet ==

set_type single inv_type CGZ max_cg_iterations 4000 max_cg_restarts 10 check yes momentum_twist 0 0 0 precision 2 source 0 number_of_propagators 2

== propagator 0: KSsolveElement ==

mass 0.000569 naik_term_epsilon 0. error_for_propagator 1e-7 rel_error_for_propagator 0.0 fresh_ksprop forget_ksprop

== propagator 1: KSsolveElement ==

mass 0.843 naik_term_epsilon -0.3578 error_for_propagator 0 rel_error_for_propagator 2e-4 fresh_ksprop forget_ksprop

== Quarks ==

number_of_quarks 2

== quark 0: QuarkIdentitySink ==

propagator 0 identity op_label d forget_ksprop

== quark 1: QuarkIdentitySink ==

propagator 1 identity op_label d forget_ksprop

number_of_mesons 1

== MesonSpectrum ==

pair 1 0 spectrum_request meson forget_corr r_offset 0 0 0 129 number_of_correlators 1 correlator P5-P5 p000-fine 1 / 124416.0 G5-G5 0 0 0 EO EO EO

== Baryons ==

number_of_baryons 0

From Peter for Grid

module swap PrgEnv-cray PrgEnv-amd module load craype-accel-amd-gfx90a

module load cray-mpich/8.1.23

module load cmake

module load amd/5.3.0

module load cray-hdf5 module load cray-fftw module load gmp module load emacs module unload cray-libsci module list

These must be set before running

export TOPDIR_HIP=~/frontier/quda export SRCROOT=${TOPDIR_HIP} export BUILDROOT=${TOPDIR_HIP} export INSTALLROOT=${TOPDIR_HIP}/install export TARGET_GPU=gfx90a

GTL_ROOT=$PE_MPICH_GTL_DIR_amd_gfx90a

GTL_ROOT=/opt/cray/pe/mpich/8.1.25/gtl/lib

MPI_CFLAGS="-I${MPICH_DIR}/include -g" MPI_LDFLAGS="-g -Wl,-rpath=${MPICH_DIR}/lib -L${MPICH_DIR}/lib -lmpi -L${GTL_ROOT} -Wl,-rpath=${GTL_ROOT} -lmpi_gtl_hsa"

export PK_BUILD_TYPE="Release"

export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${PATH}

QIOLIB=${INSTALLROOT}/qio/lib QMPLIB=${INSTALLROOT}/qmp/lib export LD_LIBRARY_PATH=${INSTALLROOT}/quda/lib:${QMPLIB}:${QIOLIB}:${ROCM_PATH}/llvm/lib64:${ROCM_PATH}/llvm/lib:${MPICH_DIR}/lib:${GTL_ROOT}:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH=/opt/cray/pe/gcc/mpfr/3.1.4/lib:${LD_LIBRARY_PATH}

end
weinbe2 commented 1 year ago

Thank you, Carleton. I'm sorry that I haven't had a chance to test this yet, but I'll be able to on Monday; the requisite scripts are essentially ready to go.

weinbe2 commented 1 year ago

Just an update, as a quick test I saw if I could reproduce the behavior on a smaller lattice (64^3x96) and I was unsuccessful, so now I'm going to try a configuration from an ensemble with the same global volume, beta, quark masses, etc. I'll keep you updated.

weinbe2 commented 1 year ago

I may have found the issue, will post back soon.

weinbe2 commented 1 year ago

I've reproduced the behavior on a 144^3 configuration, for both an odd and even source. It seems like the logic for heavy quark residual reliable updates is breaking down in ways that it wasn't on smaller configurations (i.e., 64^3, 96^3). We have a call on Wednesday and we'll figure out a solution.

detar commented 1 year ago

Hi Evan,

Any progress?

Thanks,

Carleton

On 5/16/23 9:03 AM, Evan Weinberg wrote:

I've reproduced the behavior on a 144^3 configuration, for both an odd /and/ even source. It seems like the logic for heavy quark residual reliable updates is breaking down in ways that it wasn't on smaller configurations (i.e., 64^3, 96^3). We have a call on Wednesday and we'll figure out a solution.

— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1549851977, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXWIVLGLKQEV6E4E3X3XGOJLVANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>

weinbe2 commented 1 year ago

I've put together a fix for the heavy quark convergence issue that at least works in the cases where I've been able to trigger the issue. Can you please test it for your case, @detar ? The code is in the branch hotfix/heavy-quark-restart. Once you've confirmed it works I'll get the ball rolling on a formal PR into develop.