Closed fionnoh closed 3 years ago
Thanks Fionn, taking a look to see what I can see on Juelich.
Rats... an initial look running (and cleaning verbose on)
tests/core/Test_where.cc
Passed for me.
module purge; ml GCC/9.3.0 OpenMPI/4.1.0rc1 mpi-settings/CUDA; export OMPI_MCA_btl=^uct,openib ; export UCX_MEMTYPE_CACHE=n ; export UCX_RNDV_SCHEME=put_zcopy export OMP_NUM_THREADS=12
../configure \ --disable-unified \ --enable-accelerator=cuda \ --enable-alloc-align=4k \ --enable-accelerator-cshift \ --enable-shm=nvlink \ --enable-comms=mpi-auto \ --disable-comms-threads \ --enable-gen-simd-width=64 \ --disable-gparity \ --disable-fermion-reps \ --enable-simd=GPU \ MPICXX=mpicxx \ CXX=nvcc \ CXXFLAGS="-ccbin g++ -gencode arch=compute_80,code=sm_80 -std=c++14 --cudart shared -lineinfo" \ LDFLAGS=" --cudart shared" \ LIBS="-lrt -lmpi "
You are missing the --cudart shared
This is CRITICAL to correct operation on Juelich. I don't think that is this issue, but please use my configure line and see what happens to your Test_where test?
Thanks !
Ok, I'll try this out. Thanks Peter!
Unfortunately this additional flag didn't solve the issue on my end. Running tests/core/Test_where also passes for me, but this is testing the where function on a LatticeComplexD, which is different from the SeqConserveredCurrent.
I've hacked Test_where a bit, extending it to look at a LatticeFermion (for sanity) and LatticePropagator (as in SeqConserveredCurrent):
unsigned int tmin = 3;
int Ls = 2;
GridCartesian * UGrid = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
GridCartesian * FGrid = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
std::vector<int> seeds4({1,2,3,4});
GridParallelRNG RNG4(UGrid); RNG4.SeedFixedIntegers(seeds4);
LatticeInteger lcoor(UGrid); LatticeCoordinate(lcoor,Nd-1);
std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
std::cout<<GridLogMessage<<"== LatticeFermion =="<<std::endl;
std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
LatticeFermion q_outF(FGrid); q_outF=0.0;
LatticeFermion tmpF(UGrid); random(RNG4,tmpF);
LatticeFermion tmp2F(UGrid);
LatticeFermion ZZF (UGrid); ZZF=0.0;
RealD nA=0.0;
RealD nB=0.0;
for(int s=0;s<Ls;s++){
nB = nB + norm2(tmpF);
tmp2F = where((lcoor>=tmin),tmpF,ZZF);
nA = nA + norm2(tmp2F);
InsertSlice(tmp2F, q_outF, s , 0);
}
RealD nQO=norm2(q_outF);
std::cout <<GridLogMessage << "norm_before_where: " << nB << std::endl;
std::cout <<GridLogMessage << "norm_after_where: " << nA << std::endl;
std::cout <<GridLogMessage << "norm_q_out: " << nQO << std::endl;
std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
std::cout<<GridLogMessage<<"== LatticePropagator =="<<std::endl;
std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
LatticePropagator q_outP(FGrid); q_outP=0.0;
LatticePropagator tmpP(UGrid); random(RNG4,tmpP);
LatticePropagator tmp2P(UGrid);
LatticePropagator ZZP (UGrid); ZZP=0.0;
nA=0.0;
nB=0.0;
for(int s=0;s<Ls;s++){
nB = nB + norm2(tmpP);
tmp2P = where((lcoor>=tmin),tmpP,ZZP);
nA = nA + norm2(tmp2P);
InsertSlice(tmp2P, q_outP, s , 0);
}
nQO=norm2(q_outP);
std::cout <<GridLogMessage << "norm_before_where: " << nB << std::endl;
std::cout <<GridLogMessage << "norm_after_where: " << nA << std::endl;
std::cout <<GridLogMessage << "norm_q_out: " << nQO << std::endl;
The output from Tesseract is:
Grid : Message : 0.331286 s : ==============================================================
Grid : Message : 0.331299 s : == LatticeFermion ==
Grid : Message : 0.331303 s : ==============================================================
Grid : Message : 0.340348 s : norm_before_where: 65623.5
Grid : Message : 0.340371 s : norm_after_where: 41026.5
Grid : Message : 0.340387 s : norm_q_out: 41026.5
Grid : Message : 0.340400 s : ==============================================================
Grid : Message : 0.340404 s : == LatticePropagator ==
Grid : Message : 0.340408 s : ==============================================================
Grid : Message : 0.418297 s : norm_before_where: 786447
Grid : Message : 0.418334 s : norm_after_where: 491296
Grid : Message : 0.418340 s : norm_q_out: 491296
The output from Juelich is:
Grid : Message : 0.637264 s : ==============================================================
Grid : Message : 0.637268 s : == LatticeFermion ==
Grid : Message : 0.637271 s : ==============================================================
Grid : Message : 0.649588 s : norm_before_where: 65623.5
Grid : Message : 0.649598 s : norm_after_where: 41026.5
Grid : Message : 0.649605 s : norm_q_out: 41026.5
Grid : Message : 0.649610 s : ==============================================================
Grid : Message : 0.649614 s : == LatticePropagator ==
Grid : Message : 0.649617 s : ==============================================================
Grid : Message : 0.697429 s : norm_before_where: 786447
Grid : Message : 0.697460 s : norm_after_where: 0
Grid : Message : 0.697464 s : norm_q_out: 0
Can you either submit your Test_where.cc as a pull request or attach it or just email it to me please?
Sure, I've sent it as an email there.
Oh joy !
I merely added some printf statements to the "predicatedWhere" internal function, and the error disappeared. Seems like a classic compiler bug type Heisenbug issue. Observing it brought the cat back to life.
Cuda 10, V100 works. Cuda 11, A100 breaks Cuda 11, A100 with printf works.
Shit.
Whoops - I'm in public and the entire internet can see my bad language. Should have chosen a MUCH stronger swear word. :)
Cuda 11.2, A100 works at Juelich.
module purge
export OMPI_MCA_btl=^uct,openib
export UCX_MEMTYPE_CACHE=n
export UCX_RNDV_SCHEME=put_zcopy
export OMP_NUM_THREADS=12
nvhome=/p/software/juwelsbooster/stages/2020/software/NVHPC/21.1-GCC-9.3.0/
target=Linux_x86_64
version=21.1
nvcudadir=$nvhome/$target/$version/cuda/11.2
nvcompdir=$nvhome/$target/$version/compilers
nvmathdir=$nvhome/$target/$version/math_libs
nvcommdir=$nvhome/$target/$version/comm_libs
export NVHPC=$nvhome
export CPP=cpp
#export PATH=$nvcompdir/bin:$PATH
export PATH=$nvcommdir/mpi/bin:$PATH
export PATH=$nvcudadir/bin:$PATH
echo PATH: $PATH
export LD_LIBRARY_PATH=$nvcudadir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcompdir/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvmathdir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/mpi/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nccl/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nvshmem/lib:$LD_LIBRARY_PATH
echo LD_LIBRARY_PATH: $LD_LIBRARY_PATH
export MANPATH=$nvcompdir/man:$MANPATH
module load GCC/9.3.0 OpenMPI/4.1.0rc1 mpi-settings/CUDA
Life on the bleeding edge... huh.
I updated the "Test_where_extended.cc" to make it more like the original, 3d volume, summing slices one at a time and cross ref the total norm. Sliceing x, then y , then z.
Doing it for Complex, Fermion and Propagator.
It still fails on Cuda 11.0:
Grid : Message : 0.860909 s : ==============================================================
Grid : Message : 0.860910 s : == LatticePropagator ==
Grid : Message : 0.860911 s : ==============================================================
Grid : Message : 0.871288 s : slice 0 0
Grid : Message : 0.872262 s : slice 1 0
Grid : Message : 0.873247 s : slice 2 0
Grid : Message : 0.874216 s : slice 3 0
Grid : Message : 0.875193 s : slice 4 0
Grid : Message : 0.876151 s : slice 5 0
Grid : Message : 0.877129 s : slice 6 0
Grid : Message : 0.878093 s : slice 7 0
Grid : Message : 0.879065 s : slice 8 0
Grid : Message : 0.880019 s : slice 9 0
Grid : Message : 0.881004 s : slice 10 0
Grid : Message : 0.881968 s : slice 11 0
Grid : Message : 0.882924 s : slice 12 0
Grid : Message : 0.883864 s : slice 13 0
Grid : Message : 0.884826 s : slice 14 0
Grid : Message : 0.885780 s : slice 15 0
Grid : Message : 0.886511 s : sliceNorm0 73462.4 0 err 73462.4
Test_where_extended: ../../../tests/core/Test_where_extended.cc:138: int main(int, char**): Assertion `abs(nn-ns) < 1.0e-10' failed.
[jwb0861:25427] *** Process received signal ***
[jwb0861:25427] Signal: Aborted (6)
[jwb0861:25427] Signal code: (-6)
And passes under Cuda 11.2
Grid : Message : 0.670630 s : ==============================================================
Grid : Message : 0.670631 s : == LatticePropagator ==
Grid : Message : 0.670632 s : ==============================================================
Grid : Message : 0.737622 s : slice 0 4494.2
Grid : Message : 0.749601 s : slice 1 4627.66
Grid : Message : 0.760626 s : slice 2 4605.53
Grid : Message : 0.771662 s : slice 3 4548.52
Grid : Message : 0.782125 s : slice 4 4591.31
Grid : Message : 0.793046 s : slice 5 4681.65
Grid : Message : 0.804003 s : slice 6 4616.34
Grid : Message : 0.814936 s : slice 7 4464.44
Grid : Message : 0.825359 s : slice 8 4723.06
Grid : Message : 0.836257 s : slice 9 4636.43
Grid : Message : 0.847341 s : slice 10 4626.75
Grid : Message : 0.857684 s : slice 11 4651
Grid : Message : 0.868136 s : slice 12 4403.01
Grid : Message : 0.879062 s : slice 13 4610.81
Grid : Message : 0.890084 s : slice 14 4758.93
Grid : Message : 0.901015 s : slice 15 4422.76
Grid : Message : 0.910689 s : sliceNorm0 73462.4 73462.4 err 0
Grid : Message : 0.911143 s : slice 0 18618.9
Grid : Message : 0.922088 s : slice 1 18262
Grid : Message : 0.932616 s : slice 2 18318.4
Grid : Message : 0.943661 s : slice 3 18263.2
Grid : Message : 0.953739 s : sliceNorm1 73462.4 73462.4 err 0
Grid : Message : 0.954187 s : slice 0 18380.5
Grid : Message : 0.965150 s : slice 1 18639.6
Grid : Message : 0.975635 s : slice 2 18217.1
Grid : Message : 0.986626 s : slice 3 18225.2
Grid : Message : 0.996678 s : sliceNorm2 73462.4 73462.4 err 0
Updated setup script.
module purge
module load GCC/9.3.0
export OMPI_MCA_btl=^uct,openib
export UCX_MEMTYPE_CACHE=n
export UCX_RNDV_SCHEME=put_zcopy
export OMP_NUM_THREADS=12
nvhome=/p/software/juwelsbooster/stages/2020/software/NVHPC/21.1-GCC-9.3.0/
target=Linux_x86_64
version=21.1
nvcudadir=$nvhome/$target/$version/cuda/11.2
nvcompdir=$nvhome/$target/$version/compilers
nvmathdir=$nvhome/$target/$version/math_libs
nvcommdir=$nvhome/$target/$version/comm_libs
export NVHPC=$nvhome
export CPP=cpp
export LD_LIBRARY_PATH=$nvcudadir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcompdir/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvmathdir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/mpi/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nccl/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nvshmem/lib:$LD_LIBRARY_PATH
echo LD_LIBRARY_PATH: $LD_LIBRARY_PATH
module load OpenMPI/4.1.0rc1 mpi-settings/CUDA
export PATH=$nvcommdir/mpi/bin:$PATH
export PATH=$nvcudadir/bin:$PATH
echo PATH: $PATH
export MANPATH=$nvcompdir/man:$MANPATH
Cuda 11.1 is broken too.
So 11.2 is a constraint for A100 and Grid
I added a Macro #error in CompilerCompatible.h to prevent compilation on Cuda 11.0 and 11.1. We can't trust these compiler versions.
Closing as compiler error and is fixed in later compilers.
The where functions in the SeqConserveredCurrent function for the Cayley 5D implementation zero the sources when running on Jeulich.
https://github.com/paboyle/Grid/blob/develop/Grid/qcd/action/fermion/implementation/CayleyFermion5DImplementation.h#L911
The functions works as expected on Tesseract, using Cuda 10.1, GCC 7.3, with the standard environment setup.
Grid was compiled with Cuda 11.0, GCC 9.3 on Jeulich, with configure options:
../configure --enable-comms=mpi --enable-simd=GPU --enable-accelerator=cuda --prefix=/p/home/jusers/ohogain1/jureca/dev/lattic e --enable-gparity=no CXX=nvcc LDFLAGS=-L/p/home/jusers/ohogain1/jureca/dev/lattice/lib/ CXXFLAGS=-ccbin mpicxx -gencode arch=comp ute_80,code=sm_80 -I/p/home/jusers/ohogain1/jureca/dev/lattice/include/ -std=c++14
See config.logI haven't been able to run with Cuda 11.0 on Tesseract, as Micheal suggested it does seem like there are driver issues that prevent us from doing so.