stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
104 stars 28 forks source link

FFSL transport in LFRic depends on a kernel that loops over halo cells #2661

Open tommbendall opened 2 months ago

tommbendall commented 2 months ago

One crucial part of the FFSL transport scheme in LFRic involves a kernel which loops over only cells in the halos. This is obviously not currently supported by PSyclone, so is currently done through a psykal_lite implementation: see the invoke_ffsl_panel_swap_kernel_type here: https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/gungho/source/psy/psykal_lite_transport_mod.F90?rev=755#L786

To summarise what needs to happen here: we need to perform a halo exchange to the full halo depth, and loop over all halo cells (and not owned cells)

This loop over the halo cells doesn't really fit in the current design of LFRic kernels, but to avoid a psykal_lite implementation, I think we need new metadata. Two first suggestions I have thought of are:

Mea culpa: I should've opened this issue several months ago when the code was added to trunk. I overlooked it and recently remembered, so sorry about that!

tommbendall commented 2 months ago

I thought I'd just add another comment here, as there is a whole category of kernels that currently have psykal lite implementations because they loop to some depth in the halo regions, before setting the output field to be clean (to ensure that halo swaps don't happen subsequently on these fields).

It might be helpful to tackle this other problem together, with whatever solution is best for this issue

arporter commented 2 months ago

Thanks @tommbendall. Those kernels that "loop to some depth in the halo regions" - is that depth set at compile-time or run-time? To summarise, we have:

It feels like that information naturally belongs to operates_on as that's what that describes. The access permissions relate to how dofs in a given cell are accessed so I don't think that works in this context.

tommbendall commented 2 months ago

Thanks Andy, I agree with that summary and that it makes most sense to add this through the operates_on variable. I suppose that the invoke would also require a depth argument.

arporter commented 1 month ago

Currently, the docs for operates_on say: image We need to extend this to support the two further cases given above. Really, CELL_COLUMN is effectively OWNED_CELL_COLUMN and then we would naturally have HALO_CELL_COLUMN and OWNED_PLUS_HALO_CELL_COLUMN. For backwards compatibility we can make OWNED_CELL_COLUMN a synonym for the existing CELL_COLUMN.

As Tom says, both of the new values of 'operates-on' will require a halo depth to be supplied. If this is only known at run-time then it will need to be passed to the kernel in the invoke, e.g.:

call invoke(a_new_kern(halo_depth, field1, field2))

I suggest that we pass this depth to the kernel subroutine immediately following the nlayers argument as it is not associated with any particular kernel argument.

However, if the depth for a particular kernel is always going to be fixed then we could think about extending the metadata to capture this value.

How does that sound @tommbendall and @TeranIvy?

tommbendall commented 1 month ago

Thanks Andy, that sounds really good to me.

Trying to think ahead, the depth to which one would loop into the haloes would probably be:

With what you're suggesting, I think that could still all be covered by the user getting that information in the algorithm layer of the code

arporter commented 1 month ago

Thanks Tom. Yes, your last two options could be captured in meta-data but that would require more engineering. However, if they are going to be the bulk of the use cases then it's probably worth implementing this as otherwise that's a lot of duplication in various algorithm layer code.

tommbendall commented 1 month ago

Right now, I think there are only a handful of these kernels but they are pretty much evenly split between the above cases!

Because there aren't many kernels, I'm not worried about algorithm-layer duplication, but I'd be more worried about the amount of engineering that you have to do to cover a couple of kernels.

So to me, it seems the best option right now is to always require that the user passes in the halo depth towhich they want to loop. This still allows all situations, since the user can can still decide in the algorithm layer if this is 1, the full depth or the clean depth. That also hopefully simplifies the psyclone side of things...

arporter commented 2 weeks ago

Right, so now I have: image and image does that look OK @tommbendall ?

tommbendall commented 2 weeks ago

Hi Andy, yes that looks great, thanks.

One final thought I have is about the description of owned_cell_column. If a kernel is performing an INC operation on a continuous field, we currently (correctly) loop into the halos to depth 1. I think we still want to specify this situation with the cell_column metadata -- I don't know if we should update the description for owned_cell_column to reflect that nuance.

Maybe this is overcomplicating things so please feel free to ignore, but the description could say: "Single column of cells from the owned region. If performing an INC operation on continuous fields this will loop to depth one in the halo, but otherwise the columns are exclusively from the owned region."

arporter commented 1 week ago

Finally getting back to this @tommbendall. I've updated the doc as per your suggestion. Hopefully I'm correct in thinking that a kernel that operates on HALO_CELL_COLUMN or OWNED_AND_HALO_CELL_COLUMN can't have any stencil accesses?

tommbendall commented 1 week ago

Thanks Andy. Sorry to be a pain but actually I think there is a kernel which would loop into the haloes and also use a stencil: https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/gungho/source/kernel/transport/common/remap_on_extended_mesh_kernel_mod.F90

which currently has this psykal_lite implementation: https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/gungho/source/psy/psykal_lite_transport_mod.F90#L212

Is it possible to be able to support stencils and looping into the halos? I think the main restriction would be that the stencil_depth + halo_depth_to_iterate_over can't exceed the full halo depth

arporter commented 1 week ago

Aha! So much for my thinking :-)

arporter commented 1 week ago

TODO:

The psykal_lite that Tom points to above contains:

cell_start = mesh%get_last_edge_cell() + 1
cell_end   = mesh%get_last_halo_cell(halo_compute_depth)

for the upper loop boundary. halo_compute_depth is passed into this PSy routine by argument.

arporter commented 1 week ago

Another question for @tommbendall : what should we do if distributed-memory support is switched off? Can we 'optimise out' any calls to kernels that have operates_on=HALO_CELL_COLUMN? Should everything else 'just work' (i.e., the call to mesh%get_last_halo_cell(...) will do the right thing)?

EDIT: actually, by default, PSyclone won't bother to get a mesh object in this case so perhaps we just call any OWNED_AND_HALO_CELL_COLUMN kernels with a halo depth of zero?

tommbendall commented 1 week ago

Thanks for the question! I think OWNED_AND_HALO_CELL_COLUMN kernels should be executed with a halo depth of zero (so that they become equivalent to OWNED_CELL_COLUMN kernels. As you say, HALO_CELL_COLUMN kernels should then not be executed at all. I think even those might just work though...

arporter commented 6 days ago

I'm starting to get there now but am painfully aware that this is probably going to clash horribly with what @sergisiso is doing in moving LFRic to use the PSyIR backend. Is it worth me merging your branch into mine do you think?

sergisiso commented 5 days ago

As discussed yesterday, feel free to go ahead implementing this on master. Although the new backend failing tests are going down, it will still take some months until everything is clean up and ready to merge.

arporter commented 15 hours ago

I had a brainwave and thought that perhaps I could use the existing Dynamo0p3RedundantComputationTrans to implement the necessary functionality. However, that only accepts an integer for the depth of the halo to compute into. Unfortunately, the loop-bounds and halo-exchange logic is not coded very well - having an integer depth is handled differently to having a variable specifying the depth (as is permitted for a kernel with a stencil). Also, in the latter case, the halo is only ever read so that only affects the halo-exchange calls, not the loop bounds.

arporter commented 11 hours ago

I was going to allow halo kernels to be optimised out of the PSy layer if distributed-memory is disabled. However, that causes a lot of issues as we can then have an empty invoke and we simply aren't setup to handle this situation. Since it's an edge case, it doesn't feel like it's worth putting a lot of effort into it at present.

arporter commented 10 hours ago

@tommbendall just a heads-up that argument_mod.F90 in LFRic core will need to have the new iteration spaces added to it. In the copy in our test infrastructure I have:

  !> @defgroup operates_on Enumeration of kernel iterator property descriptors.
  !> @{
  integer, public, parameter :: CELL_COLUMN                = 396
  integer, public, parameter :: OWNED_CELL_COLUMN          = 396
  integer, public, parameter :: HALO_CELL_COLUMN           = 397
  integer, public, parameter :: OWNED_AND_HALO_CELL_COLUMN = 398
  integer, public, parameter :: DOMAIN                     = 945
  integer, public, parameter :: DOF                        = 712