Tuning `Planck_source_kernel`

Chiil commented 3 years ago

First tuning of Planck_source_kernel by @julietbravo gave (again) a very small block size of (1,3,2). Please have another look @benvanwerkhoven and @isazi .

goord commented 3 years ago

@julietbravo could you post scatter plot from you slide here as well?

I can imagine that if there are almost no coalesced memory accesses in the kernel, there is not much benefit in increasing the block sizes...

julietbravo commented 3 years ago

kernel_tuning

isazi commented 3 years ago

I will run the code in nsight-compute to see if there is a reasonable explanation for the performance of the best configuration.

isazi commented 3 years ago

@julietbravo how can I generate the binary files with the data? I am trying to run rcemip but I get the following errors:

(venv) [15:29] [alessio@node053] [~/src/rte-rrtmgp-cpp/rcemip]$ ./test_rte_rrtmgp 
###### Starting RTE+RRTMGP solver ######
Solver settings:
cloud-optics         = false
fluxes               = true
longwave             = true
output-bnd-fluxes    = false
output-optical       = false
shortwave            = true
Reading atmospheric input data from NetCDF.
WARNING: Gas "co" not available in input file.
WARNING: Gas "ccl4" not available in input file.
WARNING: Gas "cfc11" not available in input file.
WARNING: Gas "cfc12" not available in input file.
WARNING: Gas "cfc22" not available in input file.
WARNING: Gas "hfc143a" not available in input file.
WARNING: Gas "hfc125" not available in input file.
WARNING: Gas "hfc23" not available in input file.
WARNING: Gas "hfc32" not available in input file.
WARNING: Gas "hfc134a" not available in input file.
WARNING: Gas "cf4" not available in input file.
WARNING: Gas "no2" not available in input file.
Preparing NetCDF output file.
Initializing the longwave solver.
Solving the longwave radiation.
EXCEPTION: play is out of range
(venv) [15:29] [alessio@node053] [~/src/rte-rrtmgp-cpp/rcemip]$

Chiil commented 3 years ago

This error is the result of the more strict range checking that I implemented on the CPU side. I have pushed a fix for this now. This error is from the CPU solver, because I still need to implement these checks on the GPU side. Please make sure to use the test_rte_rrtmgp_gpu executable.

julietbravo commented 3 years ago

The .bin files are generated by the cuda_dump_bins branch. Running the rcemip case should result in a bunch of .bin files in that directory. This branch is not really usable for anything else, as it throw's directly after calling the kernels for the first time: https://github.com/microhh/rte-rrtmgp-cpp/blob/cuda_dump_bins/src_kernels_cuda/gas_optics_kernel_launchers.cu#L386

isazi commented 3 years ago

Refactored the tuning script with the idea to use it both for tuning, and for executing the best performing (or any) configuration to help profiling.

gemoulard commented 3 years ago

I made a draft (not tested) of the kernel to improve parallelism and memory coalescing. It should work if all gpts (bands) are contiguous in memory without any "holes"

https://github.com/ESiWACE-S1/rte-rrtmgp-cpp/commit/01798851cf909f15b08774bd21e7f303421d52cc#diff-346d6f2ed8f7a322ee8183c8152d1e48ccabd641d7b31fd9129e18fd88f9096dR307

MennoVeerman commented 3 years ago

The g-points and bands are contiguous in memory, e.g. 1 2 3 ... 15 16 (band1) 17 18 ... 31 32 (band2)

benvanwerkhoven commented 3 years ago

Together with @julietbravo I took at the planck source kernel. We don't fully understand the changes proposed by @gemoulard, but perhaps we can discuss these changes with George Emmanuel in a short meeting at some point.

A few observations that we made when inspecting the code is that the index in x-dimension (the index of the band) is used to look up gpt_start and gpt_end (so different threads in the x-dimension may see different values for start and stop. And these are used as start and stop condition for several of the for-loops in the kernel. This may give performance issues, it is better to have the same loop iteration counts for threads that are adjacent in the x-dimension.

Also, the last two for-loops can probably be merged.

We would like to experiment with this kernel, but we first need to fix the tuning script. @julietbravo is going to look at fixing the tuning script to work with the latest bin files for the plank source kernel. And I will experiment with swapping the dimensions in the kernel after that.

julietbravo commented 3 years ago

@benvanwerkhoven, strange, the correctness checks pass (on a V100):

strat008@euw-vm-bvs-slocs-o:~/models/rte-rrtmgp-cpp/tuning_kernels_cuda$ python3 planck_source_kernel.py --run
Running Planck_source_kernel<double> [block_size_x: 14, block_size_y: 1, block_size_z: 32]
Using: NVIDIA Tesla V100-PCIE-16GB
results for sfc_src: OKAY!
results for lay_src: OKAY!
results for lev_src_inc: OKAY!
results for lev_srd_dec: OKAY!
results for sfc_src_jac: OKAY!

benvanwerkhoven commented 3 years ago

I just discovered that my bin files are probably using type_bool = 8 bits rather than 32 bits and this causes a discrepancy between our setups rather than the difference in GPUs

benvanwerkhoven commented 3 years ago

For the record, it indeed turned out that it was a problem with type_bool. The bins were generated with 8bit bools, but the code compiled with Kernel Tuner was defaulting to 32bit bools. We've switched to using only 8bit bools everywhere now and the errors have disappeared.

benvanwerkhoven commented 3 years ago

I took another look at the Planck source kernel using the lessons learned from working on the tau_minor kernel.

We can do the same trick to Planck source kernel as for tau_minor. The 3D interpolate by flav can be inlined and all 4 loops can be fused. Removing the need for loads and stores to pfrac which can then just be a register. This does require that the if on ilay for the surface is moved into the loop and we'll need to check on some of the computations that are currently outside of the gpt-start-end loops.

I also think it might be better to iterate over the bands, as is done with nscales in tau_minor, rather than parallelizing this using the block x-dimension. In this way we could perhaps, like @gemoulard proposed, use the x-dim to parallelize the gpt loop. That should improve coalescing for nearly all global memory loads/stores. But also requires to either use a subset of the threads to do the work currently outside of the gpt loops or duplicate a few computations.

So I would start with inling interpolate 3D, fusing all loops, and moving pfrac to a register. Then check the parallelization across bands vs across gpts. You could even use a max_gpt of some size if the gpts per band never exceed some small value like 16 or 32, as is the case with the use_shared_tau version of the tau_minor kernel.

julietbravo commented 3 years ago

@benvanwerkhoven, I think that's exactly what I already tried in the cuda_planck_source branch (https://github.com/microhh/rte-rrtmgp-cpp/blob/cuda_planck_source/src_kernels_cuda/gas_optics_kernels.cu#L130), especially the removal of the pfrac store/load gave quite a performance improvement

I'm still not convinced about making the gpt loop parallel:

The loop range (gpt_start, gpt_end) are variable and depend on the value of ibnd
A number of calculations (e.g. planck_function_lay/lev/... are independent of gpt, so you would probably have to pre-calculate them..

But it is worth a try, I'll look closer into that kernel today.

benvanwerkhoven commented 3 years ago

I tried the same principle on the tau_minor kernel and it gave a huge speedup, I think it's definitely worth the try. Having the gpts in parallel by the thread block x-dimension really improves the memory access pattern on almost all global memory accesses in the kernel.

julietbravo commented 3 years ago

I got @gemoulard 's approach working, with the x-threads over igpt. After tuning, this resulted in a 2.9x speedup compared to my version of last Friday!

I tried to get rid of some of the overlapping calculations (which don't depend on igpt, but do depend on ibnd, which does to some extend depend on igpt....) by using shared memory (like @benvanwerkhoven did for the tau_minor kernel), or pre-calculating them in a separate kernel, but both approaches only made the kernel slower.

Screenshot_20210628_162815

Profile, including Ben's work on tau_minor:

 Time(%)  Total Time (ns)  Instances   Average   Minimum  Maximum                                                  Name
 -------  ---------------  ---------  ---------  -------  -------  ----------------------------------------------------------------------------------------------------
    21.6         31783324          8  3972915.5  3969855  3975520  void (anonymous namespace)::sw_source_adding_kernel<double>(int, int, int, signed char, double cons…
     8.8         12911950         32   403498.4   397145   412314  void (anonymous namespace)::reorder123x321_kernel<double>(int, int, int, double const*, double*)
     8.0         11755329          8  1469416.1  1465705  1480488  void (anonymous namespace)::lw_solver_noscat_step1_kernel<double>(int, int, int, double, signed cha…
     7.7         11361992          8  1420249.0  1406121  1433833  void (anonymous namespace)::lw_solver_noscat_step2_kernel<double>(int, int, int, double, signed cha…
     7.5         10975085          8  1371885.6  1370282  1377578  void (anonymous namespace)::compute_tau_rayleigh_kernel<double>(int, int, int, int, int, int, int, …
     7.1         10432085          8  1304010.6  1301419  1306410  void (anonymous namespace)::sw_2stream_kernel<double>(int, int, int, double, double const*, double …
     6.4          9388584         16   586786.5   543927   630613  void (anonymous namespace)::compute_tau_major_absorption_kernel<double>(int, int, int, int, int, in…
     6.3          9272618         32   289769.3   181181   357658  void (anonymous namespace)::compute_tau_minor_absorption_kernel<double>(int, int, int, int, int, in…
     5.9          8718160          8  1089770.0  1089102  1090638  void (anonymous namespace)::lw_solver_noscat_step3_kernel<double>(int, int, int, double, signed cha…
     5.1          7492551          8   936568.9   932113   939313  void (anonymous namespace)::combine_and_reorder_2str_kernel<double>(int, int, int, double, double c…
     5.0          7316362          8   914545.3   909714   920433  void (anonymous namespace)::Planck_source_kernel<double>(int, int, int, int, int, int, int, int, in…
     4.4          6490772         40   162269.3   151293   178429  void (anonymous namespace)::sum_broadband<double>(int, int, int, double const*, double*)
     3.2          4766098         16   297881.1   268700   328699  void (anonymous namespace)::interpolation_kernel<double>(int, int, int, int, int, int, int, double,…

julietbravo commented 3 years ago

I merged the code into the cuda branch.

julietbravo commented 3 years ago

There was no need to calculate ibnd from igpt, there was already a lookup table with that info available. Removing the loop which did the gpt->bnd conversion gives another 7% speedup.

microhh / rte-rrtmgp-cpp

Tuning `Planck_source_kernel` #10