pohlan / SheetModel.jl

2 stars 1 forks source link

What counts as a new read operation? #19

Closed pohlan closed 3 years ago

pohlan commented 3 years ago

To define T_eff I need to better understand how to count the read operations. For example, if h is used in several lines within a kernel, is it only read once, once per line or what is the rule? And what if it is used in different kernel functions?

To test this I set up a simple script:

using ParallelStencil
@init_parallel_stencil(CUDA, Float64, 2)  # for GPU

@parallel_indices (ix,iy) function do_something!(H, out1, out2)
    nx, ny = size(H)
    if ix < nx+1 && iy < ny +1
        out1[ix, iy] = H[ix, iy]
        out2[ix, iy] = H[ix, iy] + out2[ix, iy]
    end
    return
end

H = Data.Array(randn(4096, 4096))
out1 = @zeros(size(H))
out2 = @zeros(size(H))

t_tic = Base.time()
for i = 1:10^4
    @parallel do_something!(H, out1, out2)
end
t_toc = Base.time() - t_tic                # execution time, s
print(t_toc)

In this example, the execution time should represent the total time spent on read/write operations since there are almost no computations. So modifying the do_something! function and comparing the execution times should give me some more hints on the question.

luraess commented 3 years ago

Good point with your MWE πŸ’―

if h is used in several lines within a kernel, is it only read once, once per line or what is the rule

It should be only counted once. In your MWE

And what if it is used in different kernel functions?

Ideally it should still be a single read as one could arrange the functions to group all operations in a single function.

Note that all T_eff defined as such (very conservative) represents a "theoretical" or ideal upper bound. Achieving memory copy rates with this metric is challenging but gives an idea how far one is from absolute optimal. One could also compute T_eff defining the numbers of R+W not as the minimal ones but taking e.g. hardware counters and seeing what is actually loaded and stored. This would however not provide information on how much one could still improve...

pohlan commented 3 years ago

I can confirm what @luraess wrote above. But I should point out that two different kernel calls result in two independent R/W operations, so we should really put everything into a single kernel, i.e. in the SheetModel merge the apply_bc!(..) function with the update_fields!(..).

Below some experiments and results. I figured out that one R or W operation took around 4 s (when carried out 10^4 times).


[...]
out1[ix, iy] = out2[ix, iy]
out2[ix, iy] = H[ix, iy]
[...]

⌚ ~16 s (=> 4 R/W)


out1[ix, iy] = H[ix, iy]
out2[ix, iy] = H[ix, iy]

⌚ ~12 s (=> 3 R/W) H is read only once!


out1[ix, iy] = H[ix, iy]
out2[ix, iy] = out1[ix, iy]

⌚ ~12 s (=> 3 R/W) It seems to be smart enough to notice that reading out1 in line 2 is the same as reading H.


out1[ix, iy] = H[ix, iy]
out2[ix, iy] = H[ix, iy] + H[ix, iy]

⌚ ~ 12 s (=> 3 R/W) H is still only read once.


[...]
out1[ix, iy] = H[ix, iy]
out2[ix, iy] = H[ix, iy] + H[ix, iy]
[...]
for i = 1:10^4
    @parallel do_something!(H, out1, out2)
    @parallel do_something!(H, out1, out2)
end

⌚ ~23 s (=> 6 R/W) When written in two separate kernels, all R/W operations are carried out twice.

pohlan commented 3 years ago

Apparently, this is also true if the operations are part of different if statements:


[...]
@parallel_indices (ix,iy) function do_something!(H, out1, out2)
    nx, ny = size(H)
    if ix < nx+1 && iy < ny +1
        if ix % 2 == 0
            out1[ix, iy] = out2[ix, iy]
        end
        if ix % 2 != 0
            out2[ix, iy] = H[ix, iy]
        end
    end
    return
end
[...]
for i = 1:10^4
    @parallel do_something!(H, out1, out2)
end
[...]

⌚ ~ 23 s (=> 4 R/W, plus additional time due to introducing ifs)


[...]
        if ix % 2 == 0
            out1[ix, iy] = H[ix, iy]
        end
        if ix % 2 != 0
            out2[ix, iy] = H[ix, iy]
        end
[...]

⌚ ~ 19 s (=> 3 R/W)


Interestingly, the upper example also takes ⌚ ~ 20 s if the statement is formulated as a if/else, probably the compiler then understands that every index is only accessed once, be it from H or out2..


πŸ’‘ This simple example also shows that if statements can reduce the perfomance quite a bit.

luraess commented 3 years ago

Very nice @pohlan ! Thanks for reporting these insights.

_And again, note that time here reports somewhat the "true" throughput which may be different from T_eff._