paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
155 stars 111 forks source link

SeqConservedCurrent giving a zero source on Juelich #346

Closed fionnoh closed 3 years ago

fionnoh commented 3 years ago

The where functions in the SeqConserveredCurrent function for the Cayley 5D implementation zero the sources when running on Jeulich.

https://github.com/paboyle/Grid/blob/develop/Grid/qcd/action/fermion/implementation/CayleyFermion5DImplementation.h#L911

The functions works as expected on Tesseract, using Cuda 10.1, GCC 7.3, with the standard environment setup.

Grid was compiled with Cuda 11.0, GCC 9.3 on Jeulich, with configure options: ../configure --enable-comms=mpi --enable-simd=GPU --enable-accelerator=cuda --prefix=/p/home/jusers/ohogain1/jureca/dev/lattic e --enable-gparity=no CXX=nvcc LDFLAGS=-L/p/home/jusers/ohogain1/jureca/dev/lattice/lib/ CXXFLAGS=-ccbin mpicxx -gencode arch=comp ute_80,code=sm_80 -I/p/home/jusers/ohogain1/jureca/dev/lattice/include/ -std=c++14 See config.log

I haven't been able to run with Cuda 11.0 on Tesseract, as Micheal suggested it does seem like there are driver issues that prevent us from doing so.

paboyle commented 3 years ago

Thanks Fionn, taking a look to see what I can see on Juelich.

paboyle commented 3 years ago

Rats... an initial look running (and cleaning verbose on)

tests/core/Test_where.cc

Passed for me.

module purge; ml GCC/9.3.0 OpenMPI/4.1.0rc1 mpi-settings/CUDA; export OMPI_MCA_btl=^uct,openib ; export UCX_MEMTYPE_CACHE=n ; export UCX_RNDV_SCHEME=put_zcopy export OMP_NUM_THREADS=12

../configure \ --disable-unified \ --enable-accelerator=cuda \ --enable-alloc-align=4k \ --enable-accelerator-cshift \ --enable-shm=nvlink \ --enable-comms=mpi-auto \ --disable-comms-threads \ --enable-gen-simd-width=64 \ --disable-gparity \ --disable-fermion-reps \ --enable-simd=GPU \ MPICXX=mpicxx \ CXX=nvcc \ CXXFLAGS="-ccbin g++ -gencode arch=compute_80,code=sm_80 -std=c++14 --cudart shared -lineinfo" \ LDFLAGS=" --cudart shared" \ LIBS="-lrt -lmpi "

You are missing the --cudart shared

This is CRITICAL to correct operation on Juelich. I don't think that is this issue, but please use my configure line and see what happens to your Test_where test?

Thanks !

fionnoh commented 3 years ago

Ok, I'll try this out. Thanks Peter!

fionnoh commented 3 years ago

Unfortunately this additional flag didn't solve the issue on my end. Running tests/core/Test_where also passes for me, but this is testing the where function on a LatticeComplexD, which is different from the SeqConserveredCurrent.

I've hacked Test_where a bit, extending it to look at a LatticeFermion (for sanity) and LatticePropagator (as in SeqConserveredCurrent):

unsigned int tmin = 3;

  int Ls = 2;
  GridCartesian * UGrid   = SpaceTimeGrid::makeFourDimGrid(GridDefaultLatt(), GridDefaultSimd(Nd,vComplex::Nsimd()),GridDefaultMpi());
  GridCartesian * FGrid   = SpaceTimeGrid::makeFiveDimGrid(Ls,UGrid);
  std::vector<int> seeds4({1,2,3,4});
  GridParallelRNG  RNG4(UGrid);  RNG4.SeedFixedIntegers(seeds4);
  LatticeInteger lcoor(UGrid); LatticeCoordinate(lcoor,Nd-1);

  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
  std::cout<<GridLogMessage<<"== LatticeFermion =="<<std::endl;
  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;

  LatticeFermion  q_outF(FGrid); q_outF=0.0;
  LatticeFermion  tmpF(UGrid); random(RNG4,tmpF);
  LatticeFermion  tmp2F(UGrid);
  LatticeFermion  ZZF (UGrid);   ZZF=0.0;

  RealD nA=0.0;
  RealD nB=0.0;
  for(int s=0;s<Ls;s++){
    nB = nB + norm2(tmpF); 
    tmp2F   = where((lcoor>=tmin),tmpF,ZZF);
    nA = nA + norm2(tmp2F); 
    InsertSlice(tmp2F, q_outF, s , 0);
  }

  RealD nQO=norm2(q_outF);
  std::cout <<GridLogMessage << "norm_before_where: " << nB << std::endl;
  std::cout <<GridLogMessage << "norm_after_where: " << nA << std::endl;
  std::cout <<GridLogMessage << "norm_q_out: " << nQO << std::endl;

  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;
  std::cout<<GridLogMessage<<"== LatticePropagator =="<<std::endl;
  std::cout<<GridLogMessage<<"=============================================================="<<std::endl;

  LatticePropagator  q_outP(FGrid); q_outP=0.0;
  LatticePropagator  tmpP(UGrid); random(RNG4,tmpP);
  LatticePropagator  tmp2P(UGrid);
  LatticePropagator  ZZP (UGrid);   ZZP=0.0;

  nA=0.0;
  nB=0.0;
  for(int s=0;s<Ls;s++){
    nB = nB + norm2(tmpP); 
    tmp2P   = where((lcoor>=tmin),tmpP,ZZP);
    nA = nA + norm2(tmp2P); 
    InsertSlice(tmp2P, q_outP, s , 0);
  }

  nQO=norm2(q_outP);
  std::cout <<GridLogMessage << "norm_before_where: " << nB << std::endl;
  std::cout <<GridLogMessage << "norm_after_where: " << nA << std::endl;
  std::cout <<GridLogMessage << "norm_q_out: " << nQO << std::endl;

The output from Tesseract is:

Grid : Message : 0.331286 s : ==============================================================
Grid : Message : 0.331299 s : == LatticeFermion ==
Grid : Message : 0.331303 s : ==============================================================
Grid : Message : 0.340348 s : norm_before_where: 65623.5
Grid : Message : 0.340371 s : norm_after_where: 41026.5
Grid : Message : 0.340387 s : norm_q_out: 41026.5
Grid : Message : 0.340400 s : ==============================================================
Grid : Message : 0.340404 s : == LatticePropagator ==
Grid : Message : 0.340408 s : ==============================================================
Grid : Message : 0.418297 s : norm_before_where: 786447
Grid : Message : 0.418334 s : norm_after_where: 491296
Grid : Message : 0.418340 s : norm_q_out: 491296

The output from Juelich is:

Grid : Message : 0.637264 s : ==============================================================
Grid : Message : 0.637268 s : == LatticeFermion ==
Grid : Message : 0.637271 s : ==============================================================
Grid : Message : 0.649588 s : norm_before_where: 65623.5
Grid : Message : 0.649598 s : norm_after_where: 41026.5
Grid : Message : 0.649605 s : norm_q_out: 41026.5
Grid : Message : 0.649610 s : ==============================================================
Grid : Message : 0.649614 s : == LatticePropagator ==
Grid : Message : 0.649617 s : ==============================================================
Grid : Message : 0.697429 s : norm_before_where: 786447
Grid : Message : 0.697460 s : norm_after_where: 0
Grid : Message : 0.697464 s : norm_q_out: 0
paboyle commented 3 years ago

Can you either submit your Test_where.cc as a pull request or attach it or just email it to me please?

fionnoh commented 3 years ago

Sure, I've sent it as an email there.

paboyle commented 3 years ago

Oh joy !

I merely added some printf statements to the "predicatedWhere" internal function, and the error disappeared. Seems like a classic compiler bug type Heisenbug issue. Observing it brought the cat back to life.

paboyle commented 3 years ago

Cuda 10, V100 works. Cuda 11, A100 breaks Cuda 11, A100 with printf works.

Shit.

paboyle commented 3 years ago

Whoops - I'm in public and the entire internet can see my bad language. Should have chosen a MUCH stronger swear word. :)

paboyle commented 3 years ago

Cuda 11.2, A100 works at Juelich.

module purge
export OMPI_MCA_btl=^uct,openib 
export UCX_MEMTYPE_CACHE=n 
export UCX_RNDV_SCHEME=put_zcopy
export OMP_NUM_THREADS=12
nvhome=/p/software/juwelsbooster/stages/2020/software/NVHPC/21.1-GCC-9.3.0/
target=Linux_x86_64
version=21.1

nvcudadir=$nvhome/$target/$version/cuda/11.2
nvcompdir=$nvhome/$target/$version/compilers
nvmathdir=$nvhome/$target/$version/math_libs
nvcommdir=$nvhome/$target/$version/comm_libs

export NVHPC=$nvhome
export CPP=cpp

#export PATH=$nvcompdir/bin:$PATH
export PATH=$nvcommdir/mpi/bin:$PATH
export PATH=$nvcudadir/bin:$PATH
echo PATH: $PATH

export LD_LIBRARY_PATH=$nvcudadir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcompdir/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvmathdir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/mpi/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nccl/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nvshmem/lib:$LD_LIBRARY_PATH
echo LD_LIBRARY_PATH: $LD_LIBRARY_PATH

export MANPATH=$nvcompdir/man:$MANPATH

module load GCC/9.3.0 OpenMPI/4.1.0rc1 mpi-settings/CUDA
paboyle commented 3 years ago

Life on the bleeding edge... huh.

paboyle commented 3 years ago

I updated the "Test_where_extended.cc" to make it more like the original, 3d volume, summing slices one at a time and cross ref the total norm. Sliceing x, then y , then z.

Doing it for Complex, Fermion and Propagator.

It still fails on Cuda 11.0:

Grid : Message : 0.860909 s : ==============================================================
Grid : Message : 0.860910 s : == LatticePropagator ==
Grid : Message : 0.860911 s : ==============================================================
Grid : Message : 0.871288 s :  slice 0 0
Grid : Message : 0.872262 s :  slice 1 0
Grid : Message : 0.873247 s :  slice 2 0
Grid : Message : 0.874216 s :  slice 3 0
Grid : Message : 0.875193 s :  slice 4 0
Grid : Message : 0.876151 s :  slice 5 0
Grid : Message : 0.877129 s :  slice 6 0
Grid : Message : 0.878093 s :  slice 7 0
Grid : Message : 0.879065 s :  slice 8 0
Grid : Message : 0.880019 s :  slice 9 0
Grid : Message : 0.881004 s :  slice 10 0
Grid : Message : 0.881968 s :  slice 11 0
Grid : Message : 0.882924 s :  slice 12 0
Grid : Message : 0.883864 s :  slice 13 0
Grid : Message : 0.884826 s :  slice 14 0
Grid : Message : 0.885780 s :  slice 15 0
Grid : Message : 0.886511 s :  sliceNorm0 73462.4 0 err 73462.4
Test_where_extended: ../../../tests/core/Test_where_extended.cc:138: int main(int, char**): Assertion `abs(nn-ns) < 1.0e-10' failed.
[jwb0861:25427] *** Process received signal ***
[jwb0861:25427] Signal: Aborted (6)
[jwb0861:25427] Signal code:  (-6)

And passes under Cuda 11.2

Grid : Message : 0.670630 s : ==============================================================
Grid : Message : 0.670631 s : == LatticePropagator ==
Grid : Message : 0.670632 s : ==============================================================
Grid : Message : 0.737622 s :  slice 0 4494.2
Grid : Message : 0.749601 s :  slice 1 4627.66
Grid : Message : 0.760626 s :  slice 2 4605.53
Grid : Message : 0.771662 s :  slice 3 4548.52
Grid : Message : 0.782125 s :  slice 4 4591.31
Grid : Message : 0.793046 s :  slice 5 4681.65
Grid : Message : 0.804003 s :  slice 6 4616.34
Grid : Message : 0.814936 s :  slice 7 4464.44
Grid : Message : 0.825359 s :  slice 8 4723.06
Grid : Message : 0.836257 s :  slice 9 4636.43
Grid : Message : 0.847341 s :  slice 10 4626.75
Grid : Message : 0.857684 s :  slice 11 4651
Grid : Message : 0.868136 s :  slice 12 4403.01
Grid : Message : 0.879062 s :  slice 13 4610.81
Grid : Message : 0.890084 s :  slice 14 4758.93
Grid : Message : 0.901015 s :  slice 15 4422.76
Grid : Message : 0.910689 s :  sliceNorm0 73462.4 73462.4 err 0
Grid : Message : 0.911143 s :  slice 0 18618.9
Grid : Message : 0.922088 s :  slice 1 18262
Grid : Message : 0.932616 s :  slice 2 18318.4
Grid : Message : 0.943661 s :  slice 3 18263.2
Grid : Message : 0.953739 s :  sliceNorm1 73462.4 73462.4 err 0
Grid : Message : 0.954187 s :  slice 0 18380.5
Grid : Message : 0.965150 s :  slice 1 18639.6
Grid : Message : 0.975635 s :  slice 2 18217.1
Grid : Message : 0.986626 s :  slice 3 18225.2
Grid : Message : 0.996678 s :  sliceNorm2 73462.4 73462.4 err 0
paboyle commented 3 years ago

Updated setup script.

module purge
module load GCC/9.3.0
export OMPI_MCA_btl=^uct,openib 
export UCX_MEMTYPE_CACHE=n 
export UCX_RNDV_SCHEME=put_zcopy
export OMP_NUM_THREADS=12
nvhome=/p/software/juwelsbooster/stages/2020/software/NVHPC/21.1-GCC-9.3.0/
target=Linux_x86_64
version=21.1

nvcudadir=$nvhome/$target/$version/cuda/11.2
nvcompdir=$nvhome/$target/$version/compilers
nvmathdir=$nvhome/$target/$version/math_libs
nvcommdir=$nvhome/$target/$version/comm_libs

export NVHPC=$nvhome
export CPP=cpp

export LD_LIBRARY_PATH=$nvcudadir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcompdir/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvmathdir/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/mpi/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nccl/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$nvcommdir/nvshmem/lib:$LD_LIBRARY_PATH
echo LD_LIBRARY_PATH: $LD_LIBRARY_PATH

module load OpenMPI/4.1.0rc1 mpi-settings/CUDA

export PATH=$nvcommdir/mpi/bin:$PATH
export PATH=$nvcudadir/bin:$PATH
echo PATH: $PATH

export MANPATH=$nvcompdir/man:$MANPATH
paboyle commented 3 years ago

Cuda 11.1 is broken too.

So 11.2 is a constraint for A100 and Grid

paboyle commented 3 years ago

I added a Macro #error in CompilerCompatible.h to prevent compilation on Cuda 11.0 and 11.1. We can't trust these compiler versions.

paboyle commented 3 years ago

Closing as compiler error and is fixed in later compilers.