GOcean accepts kernels that cannot be parallelised

stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran

BSD 3-Clause "New" or "Revised" License

104 stars 27 forks source link

GOcean accepts kernels that cannot be parallelised #2531

Open hiker opened 6 months ago

hiker commented 6 months ago

In some tests it was discovered that gocean accepts kernels that cannot be parallelised (because of a stencil access, and a write statement to the same field): https://github.com/stfc/PSyclone/pull/2497/files#r1505787613

I have to admit I wasn't aware that this is a requirement, but I think it also needs to be stressed more in the documentation. ATM we only have (https://psyclone.readthedocs.io/en/latest/kernel_layer.html#kernel-layer):

The reason for doing this is that it gives the PSy layer the responsibility of calling the Kernel over the spatial domain which is where parallelism is typically exploited in finite element and finite difference codes. The PSy layer is therefore able to call the kernel layer in a flexible way (blocked and/or in parallel for example). Kernel code in the kernel layer is not allowed to include any parallelisation calls or directives and works on raw Fortran arrays (to allow the compiler to optimise the code).

Not sure if I've missed a stronger statement elsewhere.

TODO:

Reject kernels that have read-write accesses to a field in a kernel (including checks for LFRic)
Update the documentation if required.

arporter commented 6 months ago

I think the rules are:

A kernel must only write to the current/'owned' cell.
Any fields with reads from neighbouring locations must be marked in the metadata as having a stencil access.

hiker commented 6 months ago

These statements do not mean that a kernel can get parallelised (which always was my understanding). The dependency tool are used to check this (in psyir/transformations/parallel_loop_trans).

So from that point of view, it sounds like this ticket can just be closed, since kernels that cannot be parallelised should be accepted.

arporter commented 6 months ago

Probably we are agreeing but just to be clear - it should be possible to parallelise (the loop over cells containing) a kernel but the metadata should contain enough information to allow PSyclone to deduce that e.g. colouring is required. (e.g. in LFRic, we can deduce when a kernel writes to a shared dof and ensure that such an access is protected appropriately.)

hiker commented 6 months ago

Hmm - I am not sure we agree :) Example, with a stencil of 010,010,010, which seems to fulfil the two rules: only write to a(i,j), and declare the stencil access (and we can even remove a(i,j) from the RHS):

   a(i,j) = a(i,j-1) + a(i,j) + a(i,j+1)

Given the normal loop ordering, this means that a(i,j-1) is the new value (updated in the previous iteration of the outer j loop), while a(i,j+1) is still the old value (to be updated next). This cannot be parallelised (over the j loop, and you could add accesses to i+-1 to also make parallelising the i-loop not only inefficient, but also invalid).

And I don't think colouring fixes this. Isn't the issue with LFRic and colouring if you have INC access to shared elements (i.e. a vertex belonging to several FEs is updated in each column, so you can end up having competing increments on the same vertex in parallel)? While in the gocean API this cannot happen at all (because you can only write to i,j)?

sergisiso commented 6 months ago

I think @hiker is right, in GOcean it does not happen because we have 2 copies of each field, the n(ow) and the a(fter). But we don't check if there are kernel dependencies that lead to "loop carried dependencies" when the kernel is the body of the loop and maybe we should or at least clarify that the "write" field can not be "read" other than element-wise.

sergisiso commented 6 months ago

Btw it does technically happend in the GOcean kernel bc_flather_u/v because it has: va(ji,jj) = va(ji,ji -1) + ... but its not a problem because of the mask, which is not something that can be automatically infered just analysing the code

sergisiso commented 6 months ago

I guess what we need to decide is if kernel with va(ji,jj) = va(ji,ji -1) + ...

has the semantics that:

it is ordered as if they were in a loop and can not be parallelised.
the accessor is to a copy of va before starting the loop and can be parallelised with a temporary.

The first represents a more strightforward psy-layer generation as we do know, but the second would be similar to the semantics of Fortran array operations, e.g. A(2:20) = A(1:19) * B which needs a temporary