Closed peytondmurray closed 3 years ago
@peytondmurray Have you ran unit tests (test/run.bash
)? Have you tried our benchmark script (bench/bench.mx3
)?
The first time I ran test/run.bash
, the tests failed at minimizer.mx3, but the expected value was within 0.00002 of the test value, very close to the tolerance. Interestingly, the second time I ran test/run.bash
, all tests passed. I wonder what causes this variability?
Here is the benchmark result: benchmark.txt
@peytondmurray minimizer
is not a core, but contributed mumax3
module. We never felt it is deterministic enough to replace relax
due to the numerical stability issues like those you see. I believe, the issue is partly due to non-deterministic execution of commands on NVidia
GPUs. IEEE 754
standard does not demand associativity of floating point operations. So any reshuffling of calculations by the driver or hardware schedulers to maximize GPU occupancy will generally produce different numerical noises and might lead minimizer
to different states.
The bottom line is don't bother with the minimizer
unit test, unless it gives terribly wrong result.
What I would like to ask you to do is to run bench without you patches to figure out any performance benefits.
@godsic The benchmark uses the Heun solver, which is unaffected by this change. I instead ran the benchmark for each of the four different solvers above for both the master
and feature/madd
branches, but I didn't actually see much of a difference at first. I thought this might be due to the fact that the benchmark only runs 100 steps of the solver, so I thought it might be a better test if I increased that number. Instead, I set the benchmark script to run 1000 steps, but the results still don't make much sense to me. While my initial tests above showed some modest improvement in performance, the story now is much less clear:
@peytondmurray Would be possible for you to run the benchmark again? We are about to release 3.10 and I am happy to merge this one if it provides performance benefits.
@godsic Sorry to be slow - I'm preparing the benchmarks to run overnight, and will post results tomorrow.
@godsic I rebased feature/madd
onto develop
and reran a benchmark on both branches. The benchmark was modified from standard problem #4:
.ovf
..ovf
, the same B_ext
from SP4 was applied, and the time the solver took to run for 1000 steps was recorded. This procedure was repeated 10 times for each solver. Here's the script I used to test the solvers:nx := 128
ny := 32
nz := 1
t0 := now()
t1 := now()
t2 := now()
t3 := now()
t4 := now()
t5 := now()
t6 := now()
t7 := now()
nsteps := 1000
Msat = 1600e3
Aex = 13e-12
Msat = 800e3
alpha = 0.02
B_ext = vector(-24.6E-3, 4.3E-3, 0)
setcellsize(1e-9, 1e-9, 3e-9)
for i := 0; i < 6; i += 1 {
t_start := now()
setgridsize(nx, ny, nz)
for j := 0; j < 10; j += 1 {
mfile := sprint("/home/pdmurray/go/src/github.com/mumax/3/test/make_relaxed_configs.out/m_", nx, "-", ny, "-", nz, ".ovf")
SetSolver(3)
m.LoadFile(mfile)
t0 = now()
steps(nsteps)
t1 = now()
SetSolver(4)
m.LoadFile(mfile)
t2 = now()
steps(nsteps)
t3 = now()
SetSolver(5)
m.LoadFile(mfile)
t4 = now()
steps(nsteps)
t5 = now()
SetSolver(6)
m.LoadFile(mfile)
t6 = now()
steps(nsteps)
t7 = now()
print(sprintf("%d, %d, %d, %6.6E, %6.6E, %6.6E, %6.6E", nx, ny, nz, t1.sub(t0).Seconds(), t3.sub(t2).Seconds(), t5.sub(t4).Seconds(), t7.sub(t6).Seconds()))
}
t_end := now()
print(t_end.sub(t_start).Seconds())
nx = 2 * nx
ny = 2 * ny
}
The benchmarks were run on a GTX 1080 Ti. After running the simulations, I plotted the execution times via matplotlib; the first row is a scatter plot of all the raw execution times as a function of the number of cells N; the second row is the mean execution time as a function of N; and the final row is the ratio of the mean execution time on the feature/madd
branch to the mean execution time using the master
branch. In all cases, black points correspond to the master
branch and red points correspond to feature/madd
. Here are the results:
And here are the raw execution times I recorded, formatted into csv files: csv_data.zip
Overall it looks like feature/madd
is maybe bit faster, depending on the solver.
The RK23, RK4, RK45, and RK56 solvers make use of the Madd2, Madd3, Madd4, Madd5, Madd6, and Madd7 functions, but only Madd2 and Madd3 are implemented as cuda kernels. The rest of these functions essentially just call nested combinations of Madd2 and Madd3 multiple times. At each timestep, the solvers are therefore launching more cuda kernels than needed each time Madd4, Madd5, Madd6, and Madd7 are being called. As I understand it, the overhead associated with launching cuda kernels can be large, and to a lesser extent there's also an overhead to calling Go functions.
I implemented cuda versions of Madd4, Madd5, Madd6, and Madd7, and modified the solvers to use these functions. The simple benchmark included in the test folder (sp4_madd_bench.mx3) shows basically no improvement for RK23 and RK4, but there is a few percent (~5%) improvement in the time it takes for
run()
to finish on my machine for RK45 and RK56.