stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
105 stars 28 forks source link

[LFRic] Add colour-tiling transformation #2244

Open tinyendian opened 1 year ago

tinyendian commented 1 year ago

A new mesh_tiling module will soon be available in LFRic, which adds methods to mesh objects that PSyclone can use for generating OpenMP-parallelised tiled loops, such as

if (mesh%is_tiled()) then
  tmap => mesh%get_coloured_tiling_map()
  do colour = 1, mesh%get_ntilecolours()
    !$omp parallel default(shared), private(tile, cell)
    !$omp do schedule(static)
    do tile = 1, mesh%get_last_halo_tile_per_colour(colour, 1)
      do cell = 1, mesh%get_last_halo_cell_per_colour_and_tile(colour, tile, 1)
        call kernel(tmap(colour, tile, cell), ...)
      end do
    end do
    !$omp end do
    !$omp end parallel
  end do
else
  ! Use coloured loop to ensure thread safety
end if

The API is very similar to the existing colouring API, with tiles added as an extra dimension. Tiling is not guaranteed to be available for each mesh (e.g., some multigrid levels may be too coarse for tiling, or a user decides not to configure it), so colouring needs to be kept as a fall-back option to ensure thread-safety. Tiling is also unlikely to be beneficial for LFRic kernels where no function space shares DOFs between neighbour cells, so it can probably be restricted to kernels that need colouring.

Documentation can be found here: https://code.metoffice.gov.uk/trac/lfric/wiki/LFRicInfrastructure/MeshTiling The implementation ticket is here: https://code.metoffice.gov.uk/trac/lfric/ticket/3572 An example of a tiled loop implementation is here: https://code.metoffice.gov.uk/trac/lfric/browser/LFRic/branches/dev/wolfganghayek/r41055_colour_tiling/gungho/source/psy/psykal_lite_mod.F90?rev=43701#L1395

hiker commented 1 year ago

I am not sure if we would want to combine tiling with colouring, since tiling might(??) be useful for single threaded setups as well, or? Do you have any idea? If it makes sense, could you give a code example of usage of tiling without colouring (e.g. I would assume that do tile = 1, mesh%get_last_halo_tile_per_colour(colour, 1) will be different).

What is the difference between mesh%get_ntilecolours() and mesh%get_ncolours()? I assumed that the colour loop would be identical, but apparently that's not the case.

In general, I would expect that there will be four transformation applied in the following order (based on what LFRic does atm):

for loop in all_loops:
    if ...
        colour_trans(loop)   # Colour trans

for loop in all_loops:
    if loop.type is not colour:                               # Don't parallelise the outer 'coloured' loop, only the inner ones
        omp_parallel_trans.apply(loop)                        # OMP parallel
        omp_do_trans.apply(loop, options={"reprod": True})    # omp do

for loop in all_loops:
    if loop.type is colour:                                   # tile the coloured loop
        tile_trans.apply(loop)

We need to think about the loop types, since the mapping variable changes. ATM we only need to test for colouring, but potentially we would have:

  1. coloured --> call kernel(cmap, ..., map_aspc2_x_vec(:,cmap(colour, cell))
  2. coloured & tiled --> call kernel(tmap, ..., map_aspc2_x_vec(:,tmap(colour, tile, cell))
  3. tiled (?? if useful) --> call kernel(cmap, ..., map_aspc2_x_vec(:,tmap(1, tile, cell)) ?? Hard-code '1' as colour?

Would it be better to just have one 'mapping' that would include tiling (and if tiling is not used, the tiling variable would be set to '1')??

tinyendian commented 1 year ago

Hi @hiker, with regards to using tiling in the non-OpenMP case, you are probably right, tiling may still be beneficial in some situations. The benefits may be smaller and will probably be highly case-dependent (which is true for tiling in general) - I think LFRic orders local neighbour cells in a given rectangular partition contiguously along one dimension, which is equivalent to (1xn) tiling and should result in a good degree of cache reuse, compared to colouring where this order of computation is broken up. Tiling also comes with a bit of overhead for each loop, but this may be negligible, apart from maybe the most lightweight kernels.

Whether or not this use case should be included should be discussed on the LFRic side - as tiling is just an optimisation, the main targets were production cases that will very likely run in an MPI+OpenMP configuration and will thus need colouring, while trying to keep the Fortran module as simple as possible (it is already ~1350 lines of code).

The number of tile colours will probably be the same as the number of cell colours in many, if not all cases, but this is not guaranteed as algorithms might change, so it seemed best to keep it separate.

arporter commented 1 year ago

(Deleted my comment as I took a look at the PSyKAl-Lite code and saw that my fears were unfounded :-) ).

tinyendian commented 1 year ago

Hi @arporter it's a good point about difficulties with type-bound procedures - I did run into problems even on CPU, where calling type-bound procedures in OpenMP regions caused crashes, but I implemented API methods that return arrays in the same way as the colouring API. I just wanted to keep the above code fragment simple, so I used type-bound procedures there.

christophermaynard commented 1 year ago

One could have tiling without colouring but as @tinyendian points out, that would only be used in a non-OpenMP case. Making LFRic physics work with OpenMP is part of the plan and I don't envisage running without OpenMP thereafter. We can keep it in our back pocket if we need it but I would say it isn't a priority. The issue is not yet "active" so can we discuss who might do it at the next PSyclone weekly.

sergisiso commented 1 year ago

I can take this one, as it is related to the performance projects, I will start this week.