Open inducer opened 3 years ago
Still experimenting with this, but so far I'm getting the sense that it's something other than the indirection that's causing resample_by_picking
to be slow.
First I tried pre-sorting by batch.to_element_indices
, and that didn't seem to have any effect. Next I tried progressively stripping down the loop (ignoring correctness) to see whether I could simplify it enough to make it disappear from the profile results. I stripped it down to this:
knl = make_loopy_program(
"""{[iel, idof]:
0<=iel<nelements and
0<=idof<n_to_nodes}""",
"result[iel, idof] = ary[iel, idof]",
[
lp.GlobalArg("result", None,
shape="nelements_result, n_to_nodes",
offset=lp.auto),
lp.GlobalArg("ary", None,
shape="nelements_vec, n_from_nodes",
offset=lp.auto),
lp.ValueArg("nelements_result", np.int32),
lp.ValueArg("nelements_vec", np.int32),
lp.ValueArg("n_from_nodes", np.int32),
"...",
],
name="resample_by_picking")
(i.e., no indirection), and it still looks about the same in the profiling results.
(Note: nelements
here should be the number of entries in batch.to_element_indices
, but looking at the generated code I can't tell if it is. More on this below.)
Before:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 16.16% 149.58ms 908 164.73us 17.120us 498.11us diff
15.78% 146.09ms 2560 57.065us 4.6710us 123.14us resample_by_picking
12.37% 114.46ms 2499 45.802us 5.5670us 92.512us multiply
11.44% 105.84ms 493 214.68us 106.69us 482.94us grudge_assign_0
10.60% 98.111ms 2001 49.031us 5.5670us 92.192us axpbyz
7.81% 72.307ms 960 75.319us 55.200us 124.96us resample_by_mat
7.76% 71.836ms 2228 32.242us 3.3280us 66.976us axpb
7.44% 68.819ms 72 955.81us 1.3440us 3.0474ms [CUDA memcpy DtoH]
7.09% 65.586ms 160 409.91us 395.33us 465.12us face_mass
1.46% 13.493ms 235 57.417us 3.8720us 68.735us [CUDA memcpy DtoD]
0.60% 5.5327ms 50 110.65us 106.78us 124.29us grudge_assign_1
0.48% 4.4395ms 40 110.99us 106.98us 124.26us grudge_assign_2
0.41% 3.8396ms 129 29.764us 1.0240us 518.59us [CUDA memcpy HtoD]
0.35% 3.2791ms 12 273.25us 14.016us 482.05us nodes_0
0.14% 1.3395ms 6 223.25us 14.368us 481.44us actx_special_sqrt
0.03% 292.25us 323 904ns 832ns 1.6640us [CUDA memset]
0.03% 276.19us 6 46.031us 6.4000us 85.567us divide
0.02% 213.18us 10 21.318us 20.960us 21.792us reduce_kernel_stage1
0.01% 124.99us 1 124.99us 124.99us 124.99us actx_special_exp
0.01% 72.831us 20 3.6410us 3.2960us 4.4160us reduce_kernel_stage2
After:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 15.15% 164.78ms 908 181.47us 17.087us 500.35us diff
14.83% 161.36ms 2560 63.032us 4.7990us 122.34us resample_by_picking
10.91% 118.63ms 493 240.63us 123.46us 484.73us grudge_assign_0
10.60% 115.31ms 2499 46.142us 6.3670us 92.160us multiply
9.09% 98.829ms 2001 49.389us 6.3670us 93.216us axpbyz
9.03% 98.180ms 3592 27.333us 1.4720us 3.1643ms [CUDA memcpy DtoH]
7.38% 80.274ms 960 83.618us 63.136us 124.64us resample_by_mat
6.77% 73.647ms 160 460.30us 459.17us 463.71us face_mass
6.68% 72.666ms 2228 32.614us 3.3280us 66.400us axpb
3.57% 38.881ms 7040 5.5220us 3.2640us 8.4480us take
3.21% 34.891ms 3649 9.5610us 1.0240us 494.24us [CUDA memcpy HtoD]
1.25% 13.577ms 235 57.775us 3.8400us 68.383us [CUDA memcpy DtoD]
0.57% 6.1862ms 50 123.72us 123.49us 124.48us grudge_assign_1
0.45% 4.9483ms 40 123.71us 123.52us 124.22us grudge_assign_2
0.30% 3.2806ms 12 273.39us 14.112us 482.46us nodes_0
0.12% 1.3401ms 6 223.34us 14.304us 481.57us actx_special_sqrt
0.03% 312.64us 323 967ns 928ns 1.6640us [CUDA memset]
0.03% 277.79us 6 46.298us 6.4640us 86.272us divide
0.02% 215.10us 10 21.510us 21.152us 22.048us reduce_kernel_stage1
0.01% 125.09us 1 125.09us 125.09us 125.09us actx_special_exp
0.01% 81.535us 20 4.0760us 3.7120us 4.6080us reduce_kernel_stage2
(this was for 10 timesteps with nel_1d = 24
, 3D, order 3).
I checked the loop sizes (nelements*n_to_nodes
):
min_size = 10580
max_size = 719440
avg_size = 365010.0
compared to 1.46M dofs in the volume discretization.
Here is the code generated for the above loop:
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif
__kernel void __attribute__ ((reqd_work_group_size(16, 1, 1))) resample_by_picking(__global double *__restrict__ result, int const result_offset, __global double const *__restrict__ ary, int const ary_offset, int const nelements_result, int const nelements_vec, int const n_from_nodes, int const n_to_nodes, int const nelements)
{
if (-1 + -1 * lid(0) + n_to_nodes >= 0)
for (int idof_outer = 0; idof_outer <= -1 + -1 * lid(0) + (15 + n_to_nodes + 15 * lid(0)) / 16; ++idof_outer)
result[result_offset + n_to_nodes * gid(0) + 16 * idof_outer + lid(0)] = ary[ary_offset + n_from_nodes * gid(0) + 16 * idof_outer + lid(0)];
}
I'm a little bit confused that it's not using nelements
(I'm passing it in explicitly to actx.call_loopy
; I get an error saying it can't be deduced if I omit it). Any ideas?
Thanks for experimenting with this!
I'm a little bit confused that it's not using
nelements
(I'm passing it in explicitly toactx.call_loopy
; I get an error saying it can't be deduced if I omit it). Any ideas?
It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.
As for removing the indirections, it's super confusing to me that that's not having an effect. It has to... that's a whole bunch of extra memory access that's now not happening (the index reads), and somehow that's not measurable? Wha?
The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.
It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.
Ah, ok.
The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.
It's 3D. I'll take a look into the visual profiler thing.
Note that the visual profiler (nvvp
) isn't very usable from Lassen, the network latency for graphical apps is too high. It is usually better to generate a profile with nvprof
, scp
it to your local machine, and import it into your local nvvp
.
Note that for the deep analysis, nvvp/nvprof can be flaky for Python applications. I've had luck with the following command:
$ export PYOPENCL_CTX=':1'
$ nvprof -f --analysis-metrics -o foo.nvvp --replay-mode application python examples/wave-eager.py
, which will create (or overwrite) the output profile in foo.nvvp
.
Note that the deep analysis (the --analysis-metrics
argument) is extremely slow, so you need to use small input parameters.
EDIT: You will need an updated cuda version:
$ ml load cuda/11
resample_by_picking
routinely shows up at the top of our GPU profiles. Here's an example from a run (mirgecom wave-eager,nel_1d = 24
, 3D, order 3):It's especially striking that it's at the top of the list because it touches lower-dimensional data. (surface vs volume)
multiply
has a similar number of calls, but it touches volume data, and it completes more quickly.I think there are two opportunities here that we could try:
IMO it's likely that this will have a benefit (but I obviously can't guarantee it). I think it's worth trying.
cc @lukeolson