Closed pohlan closed 3 years ago
Good point with your MWE π―
if
h
is used in several lines within a kernel, is it only read once, once per line or what is the rule
It should be only counted once. In your MWE
out1
1 Wout2
1 R+WH
should be 1 RAnd what if it is used in different kernel functions?
Ideally it should still be a single read as one could arrange the functions to group all operations in a single function.
Note that all T_eff
defined as such (very conservative) represents a "theoretical" or ideal upper bound. Achieving memory copy rates with this metric is challenging but gives an idea how far one is from absolute optimal. One could also compute T_eff
defining the numbers of R+W not as the minimal ones but taking e.g. hardware counters and seeing what is actually loaded and stored. This would however not provide information on how much one could still improve...
I can confirm what @luraess wrote above. But I should point out that two different kernel calls result in two independent R/W operations, so we should really put everything into a single kernel, i.e. in the SheetModel merge the apply_bc!(..)
function with the update_fields!(..)
.
Below some experiments and results. I figured out that one R or W operation took around 4 s (when carried out 10^4 times).
[...]
out1[ix, iy] = out2[ix, iy]
out2[ix, iy] = H[ix, iy]
[...]
β ~16 s (=> 4 R/W)
out1[ix, iy] = H[ix, iy]
out2[ix, iy] = H[ix, iy]
β ~12 s (=> 3 R/W)
H
is read only once!
out1[ix, iy] = H[ix, iy]
out2[ix, iy] = out1[ix, iy]
β ~12 s (=> 3 R/W)
It seems to be smart enough to notice that reading out1
in line 2 is the same as reading H
.
out1[ix, iy] = H[ix, iy]
out2[ix, iy] = H[ix, iy] + H[ix, iy]
β ~ 12 s (=> 3 R/W)
H
is still only read once.
[...]
out1[ix, iy] = H[ix, iy]
out2[ix, iy] = H[ix, iy] + H[ix, iy]
[...]
for i = 1:10^4
@parallel do_something!(H, out1, out2)
@parallel do_something!(H, out1, out2)
end
β ~23 s (=> 6 R/W) When written in two separate kernels, all R/W operations are carried out twice.
Apparently, this is also true if the operations are part of different if
statements:
[...]
@parallel_indices (ix,iy) function do_something!(H, out1, out2)
nx, ny = size(H)
if ix < nx+1 && iy < ny +1
if ix % 2 == 0
out1[ix, iy] = out2[ix, iy]
end
if ix % 2 != 0
out2[ix, iy] = H[ix, iy]
end
end
return
end
[...]
for i = 1:10^4
@parallel do_something!(H, out1, out2)
end
[...]
β ~ 23 s (=> 4 R/W, plus additional time due to introducing if
s)
[...]
if ix % 2 == 0
out1[ix, iy] = H[ix, iy]
end
if ix % 2 != 0
out2[ix, iy] = H[ix, iy]
end
[...]
β ~ 19 s (=> 3 R/W)
Interestingly, the upper example also takes β ~ 20 s if the statement is formulated as a if/else
, probably the compiler then understands that every index is only accessed once, be it from H
or out2
..
π‘ This simple example also shows that if
statements can reduce the perfomance quite a bit.
Very nice @pohlan ! Thanks for reporting these insights.
_And again, note that time here reports somewhat the "true" throughput which may be different from T_eff
._
To define
T_eff
I need to better understand how to count the read operations. For example, ifh
is used in several lines within a kernel, is it only read once, once per line or what is the rule? And what if it is used in different kernel functions?To test this I set up a simple script:
In this example, the execution time should represent the total time spent on read/write operations since there are almost no computations. So modifying the
do_something!
function and comparing the execution times should give me some more hints on the question.