Closed i-kanamori closed 3 years ago
These should differ by rounding, but the deviations look too big for double precision.
Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 1 ./Test_hmc_WilsonGauge --mpi 1.1.1.1 > 1.1.1.1.log Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.2.1 > 1.1.2.1.log Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.1.2 > 1.1.1.2.log Peters-Laptop:hmc peterboyle$ mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 2.1.1.1> 2.1.1.1.log Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.1.log | head -n 5 Grid : Message : 0.864237 s : Total H after trajectory = 93375.2800421747 dH = 4.92854652258393 Grid : Message : 1.708007 s : Total H after trajectory = 102822.095974671 dH = 2.54167550388956 Grid : Message : 2.563212 s : Total H after trajectory = 110324.450679314 dH = 2.06249609014776 Grid : Message : 3.446958 s : Total H after trajectory = 114845.872704475 dH = 1.4033088391152 Grid : Message : 4.341870 s : Total H after trajectory = 118280.461146461 dH = 1.27991357271094 Peters-Laptop:hmc peterboyle$ grep dH 2.1.1.1.log | head -n 5 Grid : Message : 0.704084 s : Total H after trajectory = 93375.280042175 dH = 4.92854652303504 Grid : Message : 1.408333 s : Total H after trajectory = 102822.095974671 dH = 2.5416755034239 Grid : Message : 2.225654 s : Total H after trajectory = 110324.450679314 dH = 2.06249608984217 Grid : Message : 3.188162 s : Total H after trajectory = 114845.872704475 dH = 1.40330883936258 Grid : Message : 4.109030 s : Total H after trajectory = 118280.46114646 dH = 1.27991357256542 Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.2.log | head -n 5 Grid : Message : 0.890846 s : Total H after trajectory = 93375.280042175 dH = 4.92854652303504 Grid : Message : 1.709296 s : Total H after trajectory = 102822.095974671 dH = 2.54167550349666 Grid : Message : 2.638189 s : Total H after trajectory = 110324.450679314 dH = 2.06249608990038 Grid : Message : 3.449204 s : Total H after trajectory = 114845.872704475 dH = 1.40330883934803 Grid : Message : 4.210936 s : Total H after trajectory = 118280.46114646 dH = 1.27991357253632 Peters-Laptop:hmc peterboyle$ grep dH 1.1.2.1.log | head -n 5 Grid : Message : 0.696674 s : Total H after trajectory = 93375.280042175 dH = 4.92854652300593 Grid : Message : 1.347612 s : Total H after trajectory = 102822.095974671 dH = 2.54167550349666 Grid : Message : 1.983796 s : Total H after trajectory = 110324.450679314 dH = 2.06249608976941 Grid : Message : 2.650094 s : Total H after trajectory = 114845.872704475 dH = 1.40330883936258 Grid : Message : 3.269707 s : Total H after trajectory = 118280.46114646 dH = 1.27991357257997
now looking into the resume from checkpoint
hmm.... I patched develop today with a single precision reduction improvement, but don't think this affects HMC as pure double.
#!/bin/sh
TEST=./Test_hmc_WilsonGauge
mpirun-openmpi-mp -np 1 ./Test_hmc_WilsonGauge --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1 > 1.1.1.1.log
mpirun-openmpi-mp -np 1 ./Test_hmc_WilsonGauge --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.2.1.1.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.2.1.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.2.1.log
mpirun-openmpi-mp -np 2 ./Test_hmc_WilsonGauge --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.2.log
Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.1.log
Grid : Message : 0.802586 s : Total H after trajectory = 93375.2800421747 dH = 4.92854652258393
Grid : Message : 1.593790 s : Total H after trajectory = 102822.095974671 dH = 2.54167550388956
Grid : Message : 2.512573 s : Total H after trajectory = 110324.450679314 dH = 2.06249609014776
Grid : Message : 3.321864 s : Total H after trajectory = 114845.872704475 dH = 1.4033088391152
Grid : Message : 4.989870 s : Total H after trajectory = 118280.461146461 dH = 1.27991357271094
Grid : Message : 6.919528 s : Total H after trajectory = 121654.717626298 dH = 0.96397155719751
Peters-Laptop:hmc peterboyle$ grep dH resume.* | grep -v Random
resume.1.1.1.1.log:Grid : Message : 0.956635 s : Total H after trajectory = 121654.717626298 dH = 0.96397155719751
resume.1.1.1.2.log:Grid : Message : 0.823563 s : Total H after trajectory = 121654.717626298 dH = 0.963971557532204
resume.1.1.2.1.log:Grid : Message : 0.671931 s : Total H after trajectory = 121654.717626298 dH = 0.9639715575031
resume.1.2.1.1.log:Grid : Message : 0.686496 s : Total H after trajectory = 121654.717626298 dH = 0.963971557532204
resume.2.1.1.1.log:Grid : Message : 0.689530 s : Total H after trajectory = 121654.717626298 dH = 0.9639715575031
This level of difference is normal DP rounding differences, with identical reproduction on 1.1.1.1 Can you try develop ?
I'm using:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Summary of configuration for Grid v0.7.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----- GIT VERSION -------------------------------------
commit: 22cfbdbb
branch: develop
date : 2020-06-24
----- PLATFORM ----------------------------------------
architecture (build) : x86_64
os (build) : darwin18.7.0
architecture (target) : x86_64
os (target) : darwin18.7.0
compiler vendor : clang
compiler version : 4.2.1
----- BUILD OPTIONS -----------------------------------
SIMD : AVX2
Threading : yes
Acceleration : none
Unified virtual memory : no
Communications type : mpi3
Shared memory allocator : shmopen
Shared memory mmap path : /var/lib/hugetlbfs/global/pagesize-2MB/
Default precision : double
Software FP16 conversion : yes
RNG choice : sitmo
GMP : yes
LAPACK : no
FFTW : yes
LIME (ILDG support) : yes
HDF5 : no
build DOXYGEN documentation : no
----- BUILD FLAGS -------------------------------------
CXXFLAGS:
-I/Users/peterboyle/QCD/SYCL/Grid
-mavx2
-mfma
-mf16c
-I/Users/peterboyle/QCD/SciDAC/install//include
-O3
-I/opt/local/include/
-std=c++11
-fno-strict-aliasing
LDFLAGS:
-L/Users/peterboyle/QCD/SYCL/Grid/build/Grid
-L/Users/peterboyle/QCD/SciDAC/install//lib
-L/opt/local/lib/
LIBS:
-lz
-lcrypto
-llime
-lfftw3f
-lfftw3
-lmpfr
-lgmp
-lstdc++
-lm
-lz
-------------------------------------------------------
CXXFLAGS=-I/opt/local/include/ LDFLAGS=-L/opt/local/lib/ CXX=mpicxx-openmpi-mp ../configure --enable-simd=AVX2 --enable-precision=double --enable-comms=mpi --with-lime=/Users/peterboyle/QCD/SciDAC/install/ --enable-openmp --enable-unified=noCXXFLAGS=-I/opt/local/include/ LDFLAGS=-L/opt/local/lib/ CXX=mpicxx-openmpi-mp ../configure --enable-simd=AVX2 --enable-precision=double --enable-comms=mpi --with-lime=/Users/peterboyle/QCD/SciDAC/install/ --enable-openmp --enable-unified=no
hmm...
TEST=./Test_rhmc_WilsonRatio
mpirun-openmpi-mp -np 1 ./$TEST --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1 > 1.1.1.1.log.wil
mpirun-openmpi-mp -np 1 ./$TEST --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.2.1.1.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.2.1.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.2.1.log.wil
mpirun-openmpi-mp -np 2 ./$TEST --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 > resume.1.1.1.2.log.wil
Peters-Laptop:hmc peterboyle$ grep dH 1.1.1.1.log.wil
Grid : Message : 155.136577 s : Total H after trajectory = 191225.662272466 dH = 0.154558169306256
Grid : Message : 298.859290 s : Total H after trajectory = 200901.028036973 dH = 0.0963503597886302
Grid : Message : 448.198120 s : Total H after trajectory = 206700.202254245 dH = 0.0631527512450702
Grid : Message : 608.196569 s : Total H after trajectory = 212083.375124674 dH = 0.0513949406449683
Grid : Message : 788.169846 s : Total H after trajectory = 215757.991007906 dH = 0.0334666226117406
Grid : Message : 982.398185 s : Total H after trajectory = 217627.720257285 dH = 0.0172267645248212
Peters-Laptop:hmc peterboyle$ grep dH resume.1.1.1.1.log.wil
Grid : Message : 212.929184 s : Total H after trajectory = 217627.720257285 dH = 0.0172267645248212
Peters-Laptop:hmc peterboyle$ grep dH resume.2.1.1.1.log.wil
Grid : Message : 155.386942 s : Total H after trajectory = 217627.720268508 dH = 0.0172380442090798
Peters-Laptop:hmc peterboyle$ grep dH resume.1.2.1.1.log.wil
Grid : Message : 176.400248 s : Total H after trajectory = 217627.720264105 dH = 0.0172336291288957
Peters-Laptop:hmc peterboyle$ grep dH resume.1.1.2.1.log.wil
Grid : Message : 172.912406 s : Total H after trajectory = 217627.720255398 dH = 0.0172248718445189
Peters-Laptop:hmc peterboyle$ grep dH resume.1.1.1.2.log.wil
Grid : Message : 176.957012 s : Total H after trajectory = 217627.720223433 dH = 0.0171929864445701
Now testing threading options rather than MPI. Though there's a possibility that running on a single node with two MPI ranks is different than running on two nodes as use of shared memory, but not for the pure gauge HMC.
And with threading and MPI
resume.1.1.1.2.log.wil:Grid : Message : 149.382506 s : Total H after trajectory = 217627.720220713 dH = 0.0171900809800718
resume.1.1.2.1.log.wil:Grid : Message : 133.875077 s : Total H after trajectory = 217627.720222004 dH = 0.0171915034006815
resume.1.2.1.1.log.wil:Grid : Message : 136.256407 s : Total H after trajectory = 217627.720222616 dH = 0.0171920914726797
resume.2.1.1.1.log.wil:Grid : Message : 139.131256 s : Total H after trajectory = 217627.720235348 dH = 0.0172048872336745
and a second time showing the threading is reproducible.
resume.1.1.1.2.log.wil:Grid : Message : 151.615792 s : Total H after trajectory = 217627.720220713 dH = 0.0171900809800718
resume.1.1.2.1.log.wil:Grid : Message : 144.173763 s : Total H after trajectory = 217627.720222004 dH = 0.0171915034006815
resume.1.2.1.1.log.wil:Grid : Message : 143.736783 s : Total H after trajectory = 217627.720222616 dH = 0.0171920914726797
I haven't tried AVX512 yet, I guess that's next step.
with AVX2, I confirmed there is no problem:
cat run4.sh
TEST=../../Test_hmc_WilsonGauge
#mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1
mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
mpirun -np 2 $TEST --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1
grep "dH " run4.log
Grid : Message : 1.281548 s : Total H after trajectory = 121576.350392332 dH = 0.441579366859514
Grid : Message : 1.270968 s : Total H after trajectory = 121576.350392332 dH = 0.441579366684891
Grid : Message : 1.272255 s : Total H after trajectory = 121576.350392332 dH = 0.441579366699443
Grid : Message : 1.267248 s : Total H after trajectory = 121576.350392332 dH = 0.441579366743099
Grid : Message : 1.290701 s : Total H after trajectory = 121576.350392332 dH = 0.441579366713995
Thanks Issaku, that's worrying. I'll take a day or two to get to AVX512 testing.
Hi,
it turned out that proper reorderings of the random number generators before/after IO are missing. The difference is caused by shuffling of the random number generators among lattice sites, so the results in physics are fine. The fix can break the backward compatibility, but enhances reproducibility as one do not have to know the MPI division during the HMC run.
I will send a pull request for the fix.
TEST=../../Test_hmc_WilsonGauge
mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType HotStart --Thermalizations 5 --Trajectories 1 >& run_hotstart_1x1x1x1.log
mpirun -np 1 $TEST --mpi 1.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x1x1x1.log
mpirun -np 2 $TEST --mpi 2.1.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_2x1x1x1.log
mpirun -np 2 $TEST --mpi 1.2.1.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x2x1x1.log
mpirun -np 2 $TEST --mpi 1.1.2.1 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x1x2x1.log
mpirun -np 2 $TEST --mpi 1.1.1.2 --StartingType CheckpointStart --StartingTrajectory 5 --Thermalizations 0 --Trajectories 1 >& run_chkpointstart_1x1x1x2.log
grep "dH " run_*log
run_chkpointstart_1x1x1x1.log:Grid : Message : 1.209982 s : Total H after trajectory = 121576.350392332 dH = 0.441579366743099
run_chkpointstart_1x1x1x2.log:Grid : Message : 1.268790 s : Total H after trajectory = 121576.350392332 dH = 0.441579366772203
run_chkpointstart_1x1x2x1.log:Grid : Message : 1.283531 s : Total H after trajectory = 121576.350392332 dH = 0.441579366757651
run_chkpointstart_1x2x1x1.log:Grid : Message : 1.233814 s : Total H after trajectory = 121576.350392332 dH = 0.441579366757651
run_chkpointstart_2x1x1x1.log:Grid : Message : 1.228802 s : Total H after trajectory = 121576.350392332 dH = 0.441579366757651
run_hotstart_1x1x1x1.log:Grid : Message : 1.202772 s : Total H after trajectory = 93323.2155773858 dH = 4.92999696303741
run_hotstart_1x1x1x1.log:Grid : Message : 2.979500 s : Total H after trajectory = 103474.778089801 dH = 2.47973175930383
run_hotstart_1x1x1x1.log:Grid : Message : 2.814675 s : Total H after trajectory = 110234.38233576 dH = 2.20764440238418
run_hotstart_1x1x1x1.log:Grid : Message : 3.615042 s : Total H after trajectory = 114866.81585045 dH = 1.3867998868227
run_hotstart_1x1x1x1.log:Grid : Message : 4.420590 s : Total H after trajectory = 118805.794947752 dH = 1.38852675918315
run_hotstart_1x1x1x1.log:Grid : Message : 7.374452 s : Total H after trajectory = 121576.350392332 dH = 0.441579366743099
Hello,
I found that HMC results change with MPI division, which I believe should not happen. In tests/hmc, the following script
gives (after grepping with "dH", added * by hand)
The last 4 lines with started from the same configuration and the same random number, but dH are different. I tried several Test[r]hmc programs, and they all showed the same behavior as above. Some of them started differ at the initial pseudo fermion action, some had no difference in the initial action.
environment I tried:
run_Test_hmc_WilsonGauge.tar.gz grid.configure.summary.txt