mrc-ide / covid-sim

This is the COVID-19 CovidSim microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London.
GNU General Public License v3.0
1.23k stars 256 forks source link

Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

Open kngott opened 4 years ago

kngott commented 4 years ago

When doing our initial testing, we successfully compiled with Intel and GCC compilers. However, the test case failed on Cori, but was successful on our workstations. We tracked the problem down to the AVX2 and AVX512 instructions in the Cray compiler wrappers:

-march=core-avx2 & -march=core-avx512

When we were tracking this down, we noticed the first divergence happened in P.SpatialBondingBox[3]: It seems to come from roundoff error in SetupModel.cpp, around line 128: P.nch = 4 * ((int)ceil(P.height / P.cheight / 4)); P.height / P.cheight is very close to 7 and P.nch ends up being 7 with avx2 and 8 without.

Note that if one prints the numbers (or inserts a line of std::atomic_memory_fence there), the numbers then agree. Both are 8. However, the code then diverges again elsewhere and the checksum is still a failure.

A general note, the checksum regression test will likely not work across different compilers and hardwares. Even a*x+b could give different answers depending on what the compiler chooses to do: fma or multiplication followed by plus.

matt-gretton-dann commented 4 years ago

So I agree the checksum regression test is not a perfect solution. As another example of the issue you highlight, we deliberately run it single-threaded so we don't get variance between different thread performance giving different results between runs. Fused multiply accumulate and vectorization will just add to the issues.

This isn't a problem running the model in full as it is stochastic anyway.

However, we haven't had the time to work out a scalable and maintainable way of running the regression test in a way that allows a small amount of variation, but doesn't let the figures drift over time.

kngott commented 4 years ago

c=1 in the regression testing did catch my eye. :)

Until a more general solution to the regression testing is worked out, I guess the best thing to do is just be aware the AVX2 and AVX512 instruction sets are known to cause variations.