paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
149 stars 106 forks source link

Dependence of HMC result on MPI division #296

Closed i-kanamori closed 3 years ago

i-kanamori commented 4 years ago

Hello,

I found that HMC results change with MPI division, which I believe should not happen. In tests/hmc, the following script

TEST=./Test_hmc_WilsonGauge
mpirun -np 2 $TEST --mpi 1.1.1.2 --StartingType HotStart --Thermalizations 5 --Trajectories 1
mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1

gives (after grepping with "dH", added * by hand)

Grid : Message : 1.252577 s : Total H after trajectory  = 93323.2155773858  dH = 4.92999696306651
Grid : Message : 1.731567 s : Total H after trajectory  = 103474.778089801  dH = 2.47973175933294
Grid : Message : 2.208259 s : Total H after trajectory  = 110234.382335759  dH = 2.20764440232597
Grid : Message : 2.684962 s : Total H after trajectory  = 114866.81585045  dH = 1.3867998868518
Grid : Message : 3.178242 s : Total H after trajectory  = 118805.794947752  dH = 1.3885267591686
Grid : Message : 4.893573 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366772203 *
Grid : Message : 1.203669 s : Total H after trajectory  = 121576.487891702  dH = 0.579078736453084 *
Grid : Message : 1.216451 s : Total H after trajectory  = 121576.487891702  dH = 0.57907873649674  *
Grid : Message : 1.252941 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366772203 *

The last 4 lines with started from the same configuration and the same random number, but dH are different. I tried several Test[r]hmc programs, and they all showed the same behavior as above. Some of them started differ at the initial pseudo fermion action, some had no difference in the initial action.

environment I tried:

run_Test_hmc_WilsonGauge.tar.gz grid.configure.summary.txt

paboyle commented 4 years ago

These should differ by rounding, but the deviations look too big for double precision.

Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 1 ./Test_hmc_WilsonGauge --mpi 1.1.1.1 > 1.1.1.1.log Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.2.1 > 1.1.2.1.log Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.1.2 > 1.1.1.2.log Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 2.1.1.1> 2.1.1.1.log Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.1.log | head -n 5 Grid : Message : 0.864237 s : Total H after trajectory = 93375.2800421747 dH = 4.92854652258393 Grid : Message : 1.708007 s : Total H after trajectory = 102822.095974671 dH = 2.54167550388956 Grid : Message : 2.563212 s : Total H after trajectory = 110324.450679314 dH = 2.06249609014776 Grid : Message : 3.446958 s : Total H after trajectory = 114845.872704475 dH = 1.4033088391152 Grid : Message : 4.341870 s : Total H after trajectory = 118280.461146461 dH = 1.27991357271094 Peters-Laptop:hmc peterboyle$ grep dH 2.1.1.1.log | head -n 5 Grid : Message : 0.704084 s : Total H after trajectory = 93375.280042175 dH = 4.92854652303504 Grid : Message : 1.408333 s : Total H after trajectory = 102822.095974671 dH = 2.5416755034239 Grid : Message : 2.225654 s : Total H after trajectory = 110324.450679314 dH = 2.06249608984217 Grid : Message : 3.188162 s : Total H after trajectory = 114845.872704475 dH = 1.40330883936258 Grid : Message : 4.109030 s : Total H after trajectory = 118280.46114646 dH = 1.27991357256542 Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.2.log | head -n 5 Grid : Message : 0.890846 s : Total H after trajectory = 93375.280042175 dH = 4.92854652303504 Grid : Message : 1.709296 s : Total H after trajectory = 102822.095974671 dH = 2.54167550349666 Grid : Message : 2.638189 s : Total H after trajectory = 110324.450679314 dH = 2.06249608990038 Grid : Message : 3.449204 s : Total H after trajectory = 114845.872704475 dH = 1.40330883934803 Grid : Message : 4.210936 s : Total H after trajectory = 118280.46114646 dH = 1.27991357253632 Peters-Laptop:hmc peterboyle$ grep dH 1.1.2.1.log | head -n 5 Grid : Message : 0.696674 s : Total H after trajectory = 93375.280042175 dH = 4.92854652300593 Grid : Message : 1.347612 s : Total H after trajectory = 102822.095974671 dH = 2.54167550349666 Grid : Message : 1.983796 s : Total H after trajectory = 110324.450679314 dH = 2.06249608976941 Grid : Message : 2.650094 s : Total H after trajectory = 114845.872704475 dH = 1.40330883936258 Grid : Message : 3.269707 s : Total H after trajectory = 118280.46114646 dH = 1.27991357257997

paboyle commented 4 years ago

now looking into the resume from checkpoint

paboyle commented 4 years ago

hmm.... I patched develop today with a single precision reduction improvement, but don't think this affects HMC as pure double.

#!/bin/sh
TEST=./Test_hmc_WilsonGauge
mpirun-openmpi-mp -np 1 ./Test_hmc_WilsonGauge --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1 > 1.1.1.1.log
mpirun-openmpi-mp -np 1 ./Test_hmc_WilsonGauge --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.2.1.1.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.2.1.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.2.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.2.log
Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.1.log 
Grid : Message : 0.802586 s : Total H after trajectory  = 93375.2800421747  dH = 4.92854652258393
Grid : Message : 1.593790 s : Total H after trajectory  = 102822.095974671  dH = 2.54167550388956
Grid : Message : 2.512573 s : Total H after trajectory  = 110324.450679314  dH = 2.06249609014776
Grid : Message : 3.321864 s : Total H after trajectory  = 114845.872704475  dH = 1.4033088391152
Grid : Message : 4.989870 s : Total H after trajectory  = 118280.461146461  dH = 1.27991357271094
Grid : Message : 6.919528 s : Total H after trajectory  = 121654.717626298  dH = 0.96397155719751
Peters-Laptop:hmc peterboyle$ grep dH resume.* | grep -v Random
resume.1.1.1.1.log:Grid : Message : 0.956635 s : Total H after trajectory  = 121654.717626298  dH = 0.96397155719751
resume.1.1.1.2.log:Grid : Message : 0.823563 s : Total H after trajectory  = 121654.717626298  dH = 0.963971557532204
resume.1.1.2.1.log:Grid : Message : 0.671931 s : Total H after trajectory  = 121654.717626298  dH = 0.9639715575031
resume.1.2.1.1.log:Grid : Message : 0.686496 s : Total H after trajectory  = 121654.717626298  dH = 0.963971557532204
resume.2.1.1.1.log:Grid : Message : 0.689530 s : Total H after trajectory  = 121654.717626298  dH = 0.9639715575031

This level of difference is normal DP rounding differences, with identical reproduction on 1.1.1.1 Can you try develop ?

paboyle commented 4 years ago

I'm using:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Summary of configuration for Grid v0.7.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----- GIT VERSION -------------------------------------
commit: 22cfbdbb
branch: develop
date  : 2020-06-24
----- PLATFORM ----------------------------------------
architecture (build)        : x86_64
os (build)                  : darwin18.7.0
architecture (target)       : x86_64
os (target)                 : darwin18.7.0
compiler vendor             : clang
compiler version            : 4.2.1
----- BUILD OPTIONS -----------------------------------
SIMD                        : AVX2
Threading                   : yes
Acceleration                : none
Unified virtual memory      : no
Communications type         : mpi3
Shared memory allocator     : shmopen
Shared memory mmap path     : /var/lib/hugetlbfs/global/pagesize-2MB/
Default precision           : double
Software FP16 conversion    : yes
RNG choice                  : sitmo
GMP                         : yes
LAPACK                      : no
FFTW                        : yes
LIME (ILDG support)         : yes
HDF5                        : no
build DOXYGEN documentation : no
----- BUILD FLAGS -------------------------------------
CXXFLAGS:
    -I/Users/peterboyle/QCD/SYCL/Grid
    -mavx2
    -mfma
    -mf16c
    -I/Users/peterboyle/QCD/SciDAC/install//include
    -O3
    -I/opt/local/include/
    -std=c++11
    -fno-strict-aliasing
LDFLAGS:
    -L/Users/peterboyle/QCD/SYCL/Grid/build/Grid
    -L/Users/peterboyle/QCD/SciDAC/install//lib
    -L/opt/local/lib/
LIBS:
    -lz
    -lcrypto
    -llime
    -lfftw3f
    -lfftw3
    -lmpfr
    -lgmp
    -lstdc++
    -lm
    -lz
-------------------------------------------------------
paboyle commented 4 years ago
CXXFLAGS=-I/opt/local/include/ LDFLAGS=-L/opt/local/lib/ CXX=mpicxx-openmpi-mp  ../configure --enable-simd=AVX2 --enable-precision=double --enable-comms=mpi --with-lime=/Users/peterboyle/QCD/SciDAC/install/ --enable-openmp --enable-unified=noCXXFLAGS=-I/opt/local/include/ LDFLAGS=-L/opt/local/lib/ CXX=mpicxx-openmpi-mp  ../configure --enable-simd=AVX2 --enable-precision=double --enable-comms=mpi --with-lime=/Users/peterboyle/QCD/SciDAC/install/ --enable-openmp --enable-unified=no
paboyle commented 4 years ago

hmm...

TEST=./Test_rhmc_WilsonRatio
mpirun-openmpi-mp -np 1 ./$TEST --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1 > 1.1.1.1.log.wil
mpirun-openmpi-mp -np 1 ./$TEST --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.2.1.1.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.2.1.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.2.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.2.log.wil
Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.1.log.wil 
Grid : Message : 155.136577 s : Total H after trajectory  = 191225.662272466  dH = 0.154558169306256
Grid : Message : 298.859290 s : Total H after trajectory  = 200901.028036973  dH = 0.0963503597886302
Grid : Message : 448.198120 s : Total H after trajectory  = 206700.202254245  dH = 0.0631527512450702
Grid : Message : 608.196569 s : Total H after trajectory  = 212083.375124674  dH = 0.0513949406449683
Grid : Message : 788.169846 s : Total H after trajectory  = 215757.991007906  dH = 0.0334666226117406
Grid : Message : 982.398185 s : Total H after trajectory  = 217627.720257285  dH = 0.0172267645248212
Peters-Laptop:hmc peterboyle$ grep dH resume.1.1.1.1.log.wil
Grid : Message : 212.929184 s : Total H after trajectory  = 217627.720257285  dH = 0.0172267645248212
Peters-Laptop:hmc peterboyle$ grep dH resume.2.1.1.1.log.wil
Grid : Message : 155.386942 s : Total H after trajectory  = 217627.720268508  dH = 0.0172380442090798
Peters-Laptop:hmc peterboyle$ grep dH resume.1.2.1.1.log.wil
Grid : Message : 176.400248 s : Total H after trajectory  = 217627.720264105  dH = 0.0172336291288957
Peters-Laptop:hmc peterboyle$ grep dH resume.1.1.2.1.log.wil
Grid : Message : 172.912406 s : Total H after trajectory  = 217627.720255398  dH = 0.0172248718445189
Peters-Laptop:hmc peterboyle$ grep dH resume.1.1.1.2.log.wil
Grid : Message : 176.957012 s : Total H after trajectory  = 217627.720223433  dH = 0.0171929864445701

Now testing threading options rather than MPI. Though there's a possibility that running on a single node with two MPI ranks is different than running on two nodes as use of shared memory, but not for the pure gauge HMC.

paboyle commented 4 years ago

And with threading and MPI

resume.1.1.1.2.log.wil:Grid : Message : 149.382506 s : Total H after trajectory  = 217627.720220713  dH = 0.0171900809800718
resume.1.1.2.1.log.wil:Grid : Message : 133.875077 s : Total H after trajectory  = 217627.720222004  dH = 0.0171915034006815
resume.1.2.1.1.log.wil:Grid : Message : 136.256407 s : Total H after trajectory  = 217627.720222616  dH = 0.0171920914726797
resume.2.1.1.1.log.wil:Grid : Message : 139.131256 s : Total H after trajectory  = 217627.720235348  dH = 0.0172048872336745

and a second time showing the threading is reproducible.

resume.1.1.1.2.log.wil:Grid : Message : 151.615792 s : Total H after trajectory  = 217627.720220713  dH = 0.0171900809800718
resume.1.1.2.1.log.wil:Grid : Message : 144.173763 s : Total H after trajectory  = 217627.720222004  dH = 0.0171915034006815
resume.1.2.1.1.log.wil:Grid : Message : 143.736783 s : Total H after trajectory  = 217627.720222616  dH = 0.0171920914726797
paboyle commented 4 years ago

I haven't tried AVX512 yet, I guess that's next step.

i-kanamori commented 4 years ago

with AVX2, I confirmed there is no problem:

cat run4.sh
TEST=../../Test_hmc_WilsonGauge

#mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1
mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1

grep "dH " run4.log 
Grid : Message : 1.281548 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366859514
Grid : Message : 1.270968 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366684891
Grid : Message : 1.272255 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366699443
Grid : Message : 1.267248 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366743099
Grid : Message : 1.290701 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366713995
paboyle commented 4 years ago

Thanks Issaku, that's worrying. I'll take a day or two to get to AVX512 testing.

i-kanamori commented 3 years ago

Hi,

it turned out that proper reorderings of the random number generators before/after IO are missing. The difference is caused by shuffling of the random number generators among lattice sites, so the results in physics are fine. The fix can break the backward compatibility, but enhances reproducibility as one do not have to know the MPI division during the HMC run.

I will send a pull request for the fix.

TEST=../../Test_hmc_WilsonGauge

mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1 >& run_hotstart_1x1x1x1.log
mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x1x1x1.log
mpirun -np 2 $TEST --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_2x1x1x1.log
mpirun -np 2 $TEST --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x2x1x1.log
mpirun -np 2 $TEST --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x1x2x1.log
mpirun -np 2 $TEST --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x1x1x2.log

grep "dH " run_*log
run_chkpointstart_1x1x1x1.log:Grid : Message : 1.209982 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366743099
run_chkpointstart_1x1x1x2.log:Grid : Message : 1.268790 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366772203
run_chkpointstart_1x1x2x1.log:Grid : Message : 1.283531 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366757651
run_chkpointstart_1x2x1x1.log:Grid : Message : 1.233814 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366757651
run_chkpointstart_2x1x1x1.log:Grid : Message : 1.228802 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366757651
run_hotstart_1x1x1x1.log:Grid : Message : 1.202772 s : Total H after trajectory  = 93323.2155773858  dH = 4.92999696303741
run_hotstart_1x1x1x1.log:Grid : Message : 2.979500 s : Total H after trajectory  = 103474.778089801  dH = 2.47973175930383
run_hotstart_1x1x1x1.log:Grid : Message : 2.814675 s : Total H after trajectory  = 110234.38233576  dH = 2.20764440238418
run_hotstart_1x1x1x1.log:Grid : Message : 3.615042 s : Total H after trajectory  = 114866.81585045  dH = 1.3867998868227
run_hotstart_1x1x1x1.log:Grid : Message : 4.420590 s : Total H after trajectory  = 118805.794947752  dH = 1.38852675918315
run_hotstart_1x1x1x1.log:Grid : Message : 7.374452 s : Total H after trajectory  = 121576.350392332  dH = 0.441579366743099