Open Chiil opened 3 years ago
@julietbravo could you post scatter plot from you slide here as well?
I can imagine that if there are almost no coalesced memory accesses in the kernel, there is not much benefit in increasing the block sizes...
I will run the code in nsight-compute to see if there is a reasonable explanation for the performance of the best configuration.
@julietbravo how can I generate the binary files with the data? I am trying to run rcemip but I get the following errors:
(venv) [15:29] [alessio@node053] [~/src/rte-rrtmgp-cpp/rcemip]$ ./test_rte_rrtmgp
###### Starting RTE+RRTMGP solver ######
Solver settings:
cloud-optics = false
fluxes = true
longwave = true
output-bnd-fluxes = false
output-optical = false
shortwave = true
Reading atmospheric input data from NetCDF.
WARNING: Gas "co" not available in input file.
WARNING: Gas "ccl4" not available in input file.
WARNING: Gas "cfc11" not available in input file.
WARNING: Gas "cfc12" not available in input file.
WARNING: Gas "cfc22" not available in input file.
WARNING: Gas "hfc143a" not available in input file.
WARNING: Gas "hfc125" not available in input file.
WARNING: Gas "hfc23" not available in input file.
WARNING: Gas "hfc32" not available in input file.
WARNING: Gas "hfc134a" not available in input file.
WARNING: Gas "cf4" not available in input file.
WARNING: Gas "no2" not available in input file.
Preparing NetCDF output file.
Initializing the longwave solver.
Solving the longwave radiation.
EXCEPTION: play is out of range
(venv) [15:29] [alessio@node053] [~/src/rte-rrtmgp-cpp/rcemip]$
This error is the result of the more strict range checking that I implemented on the CPU side. I have pushed a fix for this now. This error is from the CPU solver, because I still need to implement these checks on the GPU side. Please make sure to use the test_rte_rrtmgp_gpu
executable.
The .bin
files are generated by the cuda_dump_bins
branch. Running the rcemip
case should result in a bunch of .bin
files in that directory. This branch is not really usable for anything else, as it throw
's directly after calling the kernels for the first time: https://github.com/microhh/rte-rrtmgp-cpp/blob/cuda_dump_bins/src_kernels_cuda/gas_optics_kernel_launchers.cu#L386
Refactored the tuning script with the idea to use it both for tuning, and for executing the best performing (or any) configuration to help profiling.
I made a draft (not tested) of the kernel to improve parallelism and memory coalescing. It should work if all gpts (bands) are contiguous in memory without any "holes"
The g-points and bands are contiguous in memory, e.g. 1 2 3 ... 15 16 (band1) 17 18 ... 31 32 (band2)
Together with @julietbravo I took at the planck source kernel. We don't fully understand the changes proposed by @gemoulard, but perhaps we can discuss these changes with George Emmanuel in a short meeting at some point.
A few observations that we made when inspecting the code is that the index in x-dimension (the index of the band) is used to look up gpt_start and gpt_end (so different threads in the x-dimension may see different values for start and stop. And these are used as start and stop condition for several of the for-loops in the kernel. This may give performance issues, it is better to have the same loop iteration counts for threads that are adjacent in the x-dimension.
Also, the last two for-loops can probably be merged.
We would like to experiment with this kernel, but we first need to fix the tuning script. @julietbravo is going to look at fixing the tuning script to work with the latest bin files for the plank source kernel. And I will experiment with swapping the dimensions in the kernel after that.
@benvanwerkhoven, strange, the correctness checks pass (on a V100):
strat008@euw-vm-bvs-slocs-o:~/models/rte-rrtmgp-cpp/tuning_kernels_cuda$ python3 planck_source_kernel.py --run
Running Planck_source_kernel<double> [block_size_x: 14, block_size_y: 1, block_size_z: 32]
Using: NVIDIA Tesla V100-PCIE-16GB
results for sfc_src: OKAY!
results for lay_src: OKAY!
results for lev_src_inc: OKAY!
results for lev_srd_dec: OKAY!
results for sfc_src_jac: OKAY!
I just discovered that my bin files are probably using type_bool = 8 bits rather than 32 bits and this causes a discrepancy between our setups rather than the difference in GPUs
For the record, it indeed turned out that it was a problem with type_bool. The bins were generated with 8bit bools, but the code compiled with Kernel Tuner was defaulting to 32bit bools. We've switched to using only 8bit bools everywhere now and the errors have disappeared.
I took another look at the Planck source kernel using the lessons learned from working on the tau_minor kernel.
We can do the same trick to Planck source kernel as for tau_minor. The 3D interpolate by flav can be inlined and all 4 loops can be fused. Removing the need for loads and stores to pfrac which can then just be a register. This does require that the if on ilay for the surface is moved into the loop and we'll need to check on some of the computations that are currently outside of the gpt-start-end loops.
I also think it might be better to iterate over the bands, as is done with nscales in tau_minor, rather than parallelizing this using the block x-dimension. In this way we could perhaps, like @gemoulard proposed, use the x-dim to parallelize the gpt loop. That should improve coalescing for nearly all global memory loads/stores. But also requires to either use a subset of the threads to do the work currently outside of the gpt loops or duplicate a few computations.
So I would start with inling interpolate 3D, fusing all loops, and moving pfrac to a register. Then check the parallelization across bands vs across gpts. You could even use a max_gpt of some size if the gpts per band never exceed some small value like 16 or 32, as is the case with the use_shared_tau version of the tau_minor kernel.
@benvanwerkhoven, I think that's exactly what I already tried in the cuda_planck_source
branch (https://github.com/microhh/rte-rrtmgp-cpp/blob/cuda_planck_source/src_kernels_cuda/gas_optics_kernels.cu#L130), especially the removal of the pfrac
store/load gave quite a performance improvement
I'm still not convinced about making the gpt
loop parallel:
gpt_start
, gpt_end
) are variable and depend on the value of ibnd
planck_function_lay/lev/...
are independent of gpt
, so you would probably have to pre-calculate them..But it is worth a try, I'll look closer into that kernel today.
I tried the same principle on the tau_minor kernel and it gave a huge speedup, I think it's definitely worth the try. Having the gpts in parallel by the thread block x-dimension really improves the memory access pattern on almost all global memory accesses in the kernel.
I got @gemoulard 's approach working, with the x-threads over igpt
. After tuning, this resulted in a 2.9x speedup compared to my version of last Friday!
I tried to get rid of some of the overlapping calculations (which don't depend on igpt
, but do depend on ibnd
, which does to some extend depend on igpt
....) by using shared memory (like @benvanwerkhoven did for the tau_minor
kernel), or pre-calculating them in a separate kernel, but both approaches only made the kernel slower.
Profile, including Ben's work on tau_minor
:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- --------- ------- ------- ----------------------------------------------------------------------------------------------------
21.6 31783324 8 3972915.5 3969855 3975520 void (anonymous namespace)::sw_source_adding_kernel<double>(int, int, int, signed char, double cons…
8.8 12911950 32 403498.4 397145 412314 void (anonymous namespace)::reorder123x321_kernel<double>(int, int, int, double const*, double*)
8.0 11755329 8 1469416.1 1465705 1480488 void (anonymous namespace)::lw_solver_noscat_step1_kernel<double>(int, int, int, double, signed cha…
7.7 11361992 8 1420249.0 1406121 1433833 void (anonymous namespace)::lw_solver_noscat_step2_kernel<double>(int, int, int, double, signed cha…
7.5 10975085 8 1371885.6 1370282 1377578 void (anonymous namespace)::compute_tau_rayleigh_kernel<double>(int, int, int, int, int, int, int, …
7.1 10432085 8 1304010.6 1301419 1306410 void (anonymous namespace)::sw_2stream_kernel<double>(int, int, int, double, double const*, double …
6.4 9388584 16 586786.5 543927 630613 void (anonymous namespace)::compute_tau_major_absorption_kernel<double>(int, int, int, int, int, in…
6.3 9272618 32 289769.3 181181 357658 void (anonymous namespace)::compute_tau_minor_absorption_kernel<double>(int, int, int, int, int, in…
5.9 8718160 8 1089770.0 1089102 1090638 void (anonymous namespace)::lw_solver_noscat_step3_kernel<double>(int, int, int, double, signed cha…
5.1 7492551 8 936568.9 932113 939313 void (anonymous namespace)::combine_and_reorder_2str_kernel<double>(int, int, int, double, double c…
5.0 7316362 8 914545.3 909714 920433 void (anonymous namespace)::Planck_source_kernel<double>(int, int, int, int, int, int, int, int, in…
4.4 6490772 40 162269.3 151293 178429 void (anonymous namespace)::sum_broadband<double>(int, int, int, double const*, double*)
3.2 4766098 16 297881.1 268700 328699 void (anonymous namespace)::interpolation_kernel<double>(int, int, int, int, int, int, int, double,…
I merged the code into the cuda
branch.
There was no need to calculate ibnd
from igpt
, there was already a lookup table with that info available. Removing the loop which did the gpt->bnd
conversion gives another 7% speedup.
First tuning of
Planck_source_kernel
by @julietbravo gave (again) a very small block size of(1,3,2)
. Please have another look @benvanwerkhoven and @isazi .