Open hiker opened 6 months ago
I think the rules are:
These statements do not mean that a kernel can get parallelised (which always was my understanding). The dependency tool are used to check this (in psyir/transformations/parallel_loop_trans).
So from that point of view, it sounds like this ticket can just be closed, since kernels that cannot be parallelised should be accepted.
Probably we are agreeing but just to be clear - it should be possible to parallelise (the loop over cells containing) a kernel but the metadata should contain enough information to allow PSyclone to deduce that e.g. colouring is required. (e.g. in LFRic, we can deduce when a kernel writes to a shared dof and ensure that such an access is protected appropriately.)
Hmm - I am not sure we agree :) Example, with a stencil of 010,010,010, which seems to fulfil the two rules: only write to a(i,j)
, and declare the stencil access (and we can even remove a(i,j)
from the RHS):
a(i,j) = a(i,j-1) + a(i,j) + a(i,j+1)
Given the normal loop ordering, this means that a(i,j-1)
is the new value (updated in the previous iteration of the outer j
loop), while a(i,j+1)
is still the old value (to be updated next). This cannot be parallelised (over the j
loop, and you could add accesses to i+-1
to also make parallelising the i-loop not only inefficient, but also invalid).
And I don't think colouring fixes this. Isn't the issue with LFRic and colouring if you have INC access to shared elements (i.e. a vertex belonging to several FEs is updated in each column, so you can end up having competing increments on the same vertex in parallel)? While in the gocean API this cannot happen at all (because you can only write to i,j
)?
I think @hiker is right, in GOcean it does not happen because we have 2 copies of each field, the n(ow) and the a(fter). But we don't check if there are kernel dependencies that lead to "loop carried dependencies" when the kernel is the body of the loop and maybe we should or at least clarify that the "write" field can not be "read" other than element-wise.
Btw it does technically happend in the GOcean kernel bc_flather_u/v because it has: va(ji,jj) = va(ji,ji -1) + ...
but its not a problem because of the mask, which is not something that can be automatically infered just analysing the code
I guess what we need to decide is if kernel with va(ji,jj) = va(ji,ji -1) + ...
has the semantics that:
The first represents a more strightforward psy-layer generation as we do know, but the second would be similar to the semantics of Fortran array operations, e.g. A(2:20) = A(1:19) * B
which needs a temporary
In some tests it was discovered that gocean accepts kernels that cannot be parallelised (because of a stencil access, and a write statement to the same field): https://github.com/stfc/PSyclone/pull/2497/files#r1505787613
I have to admit I wasn't aware that this is a requirement, but I think it also needs to be stressed more in the documentation. ATM we only have (https://psyclone.readthedocs.io/en/latest/kernel_layer.html#kernel-layer):
Not sure if I've missed a stronger statement elsewhere.
TODO: