Open nmnobre opened 2 years ago
One thing I believe we should do is converge the APIs logic with the generic data movement. At its essence, the data movement logic just analyses a given tree and decides which variables dependencies exist at the boundaries of the region. Which is done with the DependencyAnalysis plus some logic on top. The DepAnalysis is currently done in:
def kernel_references(self):
# pylint: disable=import-outside-toplevel
from psyclone.dynamo0p3 import DynInvokeSchedule
from psyclone.gocean1p0 import GOInvokeSchedule
from psyclone.psyir.tools import DependencyTools
if self.ancestor((DynInvokeSchedule, GOInvokeSchedule)):
# Look-up the kernels that are children of this node
sig_set = set()
for call in self.kernels():
for arg in call.arguments.acc_args:
sig_set.add(arg)
return (sig_set, )
inp, out = DependencyTools().get_in_out_parameters(self.children)
return (set(inp), set(out))
We may not need the exception for API kernel calls. DependencyTools.get_inout_parameters calls VariablesAccessInfo which calls for each node the node.reference_accesses
method. The reference_accesses of a KernelCall is very similar to self.arguments.acc_args.
If I do the get_in_out_parameters
of a GOcean kernel call I get:
(Pdb) sig_set
{'p_fld%data', 'cu_fld', 'cu_fld%data', 'u_fld%data', 'p_fld', 'u_fld'}
(Pdb) inp, out = DependencyTools().get_in_out_parameters(self.children)
(Pdb) inp
[Signature(cu_fld%internal%xstart), Signature(cu_fld%internal%xstop), Signature(cu_fld%internal%ystart), Signature(cu_fld%internal%ystop), Signature(p_fld), Signature(u_fld)]
(Pdb) out
[Signature(cu_fld), Signature(i), Signature(j)]
which essentially conveys the same information. With the differences: 1) Loop variables i and j are removed (because their lifecycle is entirely inside the data region?).
- GOCean and Dynamo do this in the acc_args logic.
- Nemo does this in the NemoACC{Update|EnterData}Directive lowering.
Both hardcode the loop variables to delete: e.g. for NEMO 'ji', 'jj', 'jk', 'jt'
This is not optimal:
- Other variables with their full lifetime inside the region are not removed.
- If this names are used for other things and used outside the region, they get removed anyway.
I think we have the tools to do a proper lifetime analysis and delete the unneeded variables without hardcoding.
(this will also get rid of the specialized classes - Point 5 in Nuno's previous comment)
2) The APIs also do the explicit pointer attach expression e.g. p_fld, p_fld%data, ... , but this again seems a generic functionality.
3) Finally the API does not provide the cu_fld%internal%xstart expressions. I am not sure why? Is this because these are scalars or some other reason?
Finally, GOcean has infrastructure support to do the host updates thanks to a callback call when doing field%get_data()
. This was very successful in avoiding unneeded preventive data copies. We can consider this for LFRic.
I've just tried this functionality on the SIR-compliant form of the tracer-advection benchmark (being added in https://github.com/stfc/PSycloneBench/issues/92). The compute subroutine is handled fine but, strictly speaking, I need to transform the driver code too (otherwise the checksum computed on the CPU is wrong). If I do that then, for the main 'time-stepping' loop, I get:
...
!$acc update if_present host(itn_count)
do jt = 1, itn_count, 1
!$acc update if_present host(jpi,jpj,jpk,mydomain,pun,pvn,pwn,rnfmsk,rnfmsk_z,tmask,tsn,umask,upsmsk,vmask,zind,zslpx,zslpy,ztfreez,zwx,zwy)
call tra_adv_compute(zind, tsn, ztfreez, rnfmsk, rnfmsk_z, upsmsk, tmask, zwx, zwy, umask, vmask, mydomain, zslpx, zslpy, pun, pvn, pwn, jpi, jpj, jpk, jt)
enddo
!$acc update if_present host(step_timer)
call timer_stop(step_timer)
...
which means we'll be copying every array over to the GPU for every 'time step'. I've a feeling we touched on this issue in the original review. In this case, I think we could identify that this transfer is unnecessary by checking that all of the arguments to the Call
are either Reference
or Literal
?
Actually, I've just realised that we'll in fact get wrong answers because the updates in the time-stepping loop will over-write the values on the GPU with out-of-date CPU values.
I didn't realize before I approved #1952 the PR (ops!) but there are a couple of lines without test coverage on the OpenACC Enter Data implementation. This should be covered on the next PR please.
if not isinstance(if_present, bool):
raise TypeError
else:
from psyclone.psyir.nodes import ACCUpdateDirective
In some experimental work looking at improving Signature
to take a Symbol
as well as text, I've hit Nuno's original point 6. above - the ACCUpdateDirective
constructor currently takes a list of Signatures and the various *Arguments.acc_args methods just return a list of strings.
This is a compilation of the outstanding issues and unimplemented features in the OpenACC enter data and update directives implementation in #310 and #1554.
1) Distinguishing between array accesses and function calls.
In the following script,
CodeBlock. get_symbol_names()
erroneously includessine
whereasDependencyTools().get_in_out_parameters()
does not. I suspect this works without the improvements in #1865 because I'm passing areal
literal tosine
, but the whole thing does fall apart if I use aninteger
literal. @hiker, would it possible to changeCodeBlock. get_symbol_names()
to behave similarly? Even when #1865 is eventually completed, this still leaves the question of how to resolve imported symbols via theuse
statement, but that's a different beast. In the meantime, I do incur the risk of including function names in the update directive's variable list, although that should be okay, since those names won't correspond to any existing variables.2) Changing
ACCEnterDataDirective
andACCUpdateDirective
to useClause
and "proper" lowering.I don't exactly know what the latter means but according to @arporter it's something we could improve on.
3) Extracting precise array access descriptions.
Currently, if we read or write only a few elements of an array on the host, the entirety of the array is being transferred from and to the device, i.e. the update directives only ever see array names and never subarray specifications with subscript ranges as allowed by the OpenACC standard. While this simplifies dependency resolution, it is obviously suboptimal and can potentially be a major disadvantage vs managed memory. Perhaps @hiker has a better idea of what's already possible here?
4) Avoiding superfluous data copies of unmodified or unused data.
We need a list of variables that are written just once and are read-only thereafter. Such variables will be written on the CPU (e.g. after reading from namelist) and copied only once to the GPU. Variables local to a particular subroutine need not and should not be copied back to the device just before they return.
5) Avoid the duplication in
NemoACCEnterDataDirective
andNemoACCUpdateDirective
.Currently, both these classes simply implement identical versions of
lower_to_language_level
. Ideally, these classes shouldn't exist, as per my opinion in #1779. For the time being though, they need moving tosrc/psyclone/domain/nemo
according to @arporter, but I'd rather move them tosrc/psyclone/domain/nemo/nodes
.6) The code for
ACCUpdateDirective
should be tweaked to avoid the usage ofSignature
in favour ofSymbol
.The concept of signatures doesn't exist in the PSyIR. In particular, this means the construction of a PSyIR tree from scratch that uses the OpenACC update directive now involves alien
Signature
objects...7) Before a
Call
node, only copy to the host memory data which is passed by value.Pass-by-value variables need to be updated on the host memory before the host makes the jump on a procedure call to avoid the proliferation of outdated data to the GPU. That shouldn't be the case for pass-by-reference variables where the compiler/runtime should be able to determine the data was already on the GPU.