Feature/madd - Githubissues

peytondmurray commented 5 years ago

The RK23, RK4, RK45, and RK56 solvers make use of the Madd2, Madd3, Madd4, Madd5, Madd6, and Madd7 functions, but only Madd2 and Madd3 are implemented as cuda kernels. The rest of these functions essentially just call nested combinations of Madd2 and Madd3 multiple times. At each timestep, the solvers are therefore launching more cuda kernels than needed each time Madd4, Madd5, Madd6, and Madd7 are being called. As I understand it, the overhead associated with launching cuda kernels can be large, and to a lesser extent there's also an overhead to calling Go functions.

Solver	# kernels launched per step	# kernel launches needed per step
RK23	6	4
RK4	5	4
RK45	15	7
RK56	21	9

I implemented cuda versions of Madd4, Madd5, Madd6, and Madd7, and modified the solvers to use these functions. The simple benchmark included in the test folder (sp4_madd_bench.mx3) shows basically no improvement for RK23 and RK4, but there is a few percent (~5%) improvement in the time it takes for run() to finish on my machine for RK45 and RK56.

godsic commented 5 years ago

@peytondmurray Have you ran unit tests (test/run.bash)? Have you tried our benchmark script (bench/bench.mx3)?

peytondmurray commented 5 years ago

The first time I ran test/run.bash, the tests failed at minimizer.mx3, but the expected value was within 0.00002 of the test value, very close to the tolerance. Interestingly, the second time I ran test/run.bash, all tests passed. I wonder what causes this variability?

Here is the benchmark result: benchmark.txt

godsic commented 5 years ago

@peytondmurray minimizer is not a core, but contributed mumax3 module. We never felt it is deterministic enough to replace relax due to the numerical stability issues like those you see. I believe, the issue is partly due to non-deterministic execution of commands on NVidia GPUs. IEEE 754 standard does not demand associativity of floating point operations. So any reshuffling of calculations by the driver or hardware schedulers to maximize GPU occupancy will generally produce different numerical noises and might lead minimizer to different states.

The bottom line is don't bother with the minimizer unit test, unless it gives terribly wrong result.

What I would like to ask you to do is to run bench without you patches to figure out any performance benefits.

peytondmurray commented 5 years ago

@godsic The benchmark uses the Heun solver, which is unaffected by this change. I instead ran the benchmark for each of the four different solvers above for both the master and feature/madd branches, but I didn't actually see much of a difference at first. I thought this might be due to the fact that the benchmark only runs 100 steps of the solver, so I thought it might be a better test if I increased that number. Instead, I set the benchmark script to run 1000 steps, but the results still don't make much sense to me. While my initial tests above showed some modest improvement in performance, the story now is much less clear:

result

godsic commented 4 years ago

@peytondmurray Would be possible for you to run the benchmark again? We are about to release 3.10 and I am happy to merge this one if it provides performance benefits.

peytondmurray commented 4 years ago

@godsic Sorry to be slow - I'm preparing the benchmarks to run overnight, and will post results tomorrow.

peytondmurray commented 4 years ago

@godsic I rebased feature/madd onto develop and reran a benchmark on both branches. The benchmark was modified from standard problem #4:

The magnetization was initialized and relaxed before being saved in an .ovf.
For each of the RK23BS, RK4, RK45DP, and RK56 solvers, the magnetization was loaded from the same .ovf, the same B_ext from SP4 was applied, and the time the solver took to run for 1000 steps was recorded. This procedure was repeated 10 times for each solver. Here's the script I used to test the solvers:

nx := 128
ny := 32
nz := 1
t0 := now()
t1 := now()
t2 := now()
t3 := now()
t4 := now()
t5 := now()
t6 := now()
t7 := now()
nsteps := 1000
Msat = 1600e3
Aex = 13e-12
Msat = 800e3
alpha = 0.02
B_ext = vector(-24.6E-3, 4.3E-3, 0)
setcellsize(1e-9, 1e-9, 3e-9)

for i := 0; i < 6; i += 1 {
        t_start := now()
        setgridsize(nx, ny, nz)

        for j := 0; j < 10; j += 1 {
                mfile := sprint("/home/pdmurray/go/src/github.com/mumax/3/test/make_relaxed_configs.out/m_", nx, "-", ny, "-", nz, ".ovf")

                SetSolver(3)
                m.LoadFile(mfile)
                t0 = now()
                steps(nsteps)
                t1 = now()

                SetSolver(4)
                m.LoadFile(mfile)
                t2 = now()
                steps(nsteps)
                t3 = now()

                SetSolver(5)
                m.LoadFile(mfile)
                t4 = now()
                steps(nsteps)
                t5 = now()

                SetSolver(6)
                m.LoadFile(mfile)
                t6 = now()
                steps(nsteps)
                t7 = now()
                print(sprintf("%d, %d, %d, %6.6E, %6.6E, %6.6E, %6.6E", nx, ny, nz, t1.sub(t0).Seconds(), t3.sub(t2).Seconds(), t5.sub(t4).Seconds(), t7.sub(t6).Seconds()))
        }
        t_end := now()
        print(t_end.sub(t_start).Seconds())
        nx = 2 * nx
        ny = 2 * ny
}

The benchmarks were run on a GTX 1080 Ti. After running the simulations, I plotted the execution times via matplotlib; the first row is a scatter plot of all the raw execution times as a function of the number of cells N; the second row is the mean execution time as a function of N; and the final row is the ratio of the mean execution time on the feature/madd branch to the mean execution time using the master branch. In all cases, black points correspond to the master branch and red points correspond to feature/madd. Here are the results:

bench

And here are the raw execution times I recorded, formatted into csv files: csv_data.zip

Overall it looks like feature/madd is maybe bit faster, depending on the solver.

mumax / 3

Feature/madd #233