nmnobre commented 2 years ago

This is a compilation of the outstanding issues and unimplemented features in the OpenACC enter data and update directives implementation in #310 and #1554.

1) Distinguishing between array accesses and function calls.

In the following script, CodeBlock. get_symbol_names() erroneously includes sine whereas DependencyTools().get_in_out_parameters() does not. I suspect this works without the improvements in #1865 because I'm passing a real literal to sine, but the whole thing does fall apart if I use an integer literal. @hiker, would it possible to change CodeBlock. get_symbol_names() to behave similarly? Even when #1865 is eventually completed, this still leaves the question of how to resolve imported symbols via the use statement, but that's a different beast. In the meantime, I do incur the risk of including function names in the update directive's variable list, although that should be okay, since those names won't correspond to any existing variables.

from fparser.common.readfortran import FortranStringReader
from fparser.two.parser import ParserFactory
from psyclone.psyir.tools import DependencyTools
from psyclone.psyGen import PSyFactory

code = '''
real function sine(x)
        implicit none
        real :: x
        sine = 1 - cos(x*1.)*cos(x*1.)
end function

program test
    real r
    r = sine(2.)
    write(*,*) sine(2.)
end program test
'''

reader = FortranStringReader(code)
parser = ParserFactory().create()
ast = parser(reader)

psy = PSyFactory("nemo", distributed_memory=False).create(ast)
schedule = psy.invokes.invoke_list[1].schedule

print(DependencyTools().get_in_out_parameters(schedule[0]))
print(schedule[1].get_symbol_names())

2) Changing ACCEnterDataDirective and ACCUpdateDirective to use Clause and "proper" lowering.

I don't exactly know what the latter means but according to @arporter it's something we could improve on.

3) Extracting precise array access descriptions.

Currently, if we read or write only a few elements of an array on the host, the entirety of the array is being transferred from and to the device, i.e. the update directives only ever see array names and never subarray specifications with subscript ranges as allowed by the OpenACC standard. While this simplifies dependency resolution, it is obviously suboptimal and can potentially be a major disadvantage vs managed memory. Perhaps @hiker has a better idea of what's already possible here?

4) Avoiding superfluous data copies of unmodified or unused data.

We need a list of variables that are written just once and are read-only thereafter. Such variables will be written on the CPU (e.g. after reading from namelist) and copied only once to the GPU. Variables local to a particular subroutine need not and should not be copied back to the device just before they return.

5) Avoid the duplication in NemoACCEnterDataDirective and NemoACCUpdateDirective.

Currently, both these classes simply implement identical versions of lower_to_language_level. Ideally, these classes shouldn't exist, as per my opinion in #1779. For the time being though, they need moving to src/psyclone/domain/nemo according to @arporter, but I'd rather move them to src/psyclone/domain/nemo/nodes.

6) The code for ACCUpdateDirective should be tweaked to avoid the usage of Signature in favour of Symbol.

The concept of signatures doesn't exist in the PSyIR. In particular, this means the construction of a PSyIR tree from scratch that uses the OpenACC update directive now involves alien Signature objects...

7) Before a Call node, only copy to the host memory data which is passed by value.

Pass-by-value variables need to be updated on the host memory before the host makes the jump on a procedure call to avoid the proliferation of outdated data to the GPU. That shouldn't be the case for pass-by-reference variables where the compiler/runtime should be able to determine the data was already on the GPU.

sergisiso commented 1 year ago

One thing I believe we should do is converge the APIs logic with the generic data movement. At its essence, the data movement logic just analyses a given tree and decides which variables dependencies exist at the boundaries of the region. Which is done with the DependencyAnalysis plus some logic on top. The DepAnalysis is currently done in:

    def kernel_references(self):

        # pylint: disable=import-outside-toplevel
        from psyclone.dynamo0p3   import DynInvokeSchedule
        from psyclone.gocean1p0   import GOInvokeSchedule
        from psyclone.psyir.tools import DependencyTools

        if self.ancestor((DynInvokeSchedule, GOInvokeSchedule)):
            # Look-up the kernels that are children of this node
            sig_set = set()
            for call in self.kernels():
                for arg in call.arguments.acc_args:
                    sig_set.add(arg)
            return (sig_set, )

        inp, out = DependencyTools().get_in_out_parameters(self.children)
        return (set(inp), set(out))

We may not need the exception for API kernel calls. DependencyTools.get_inout_parameters calls VariablesAccessInfo which calls for each node the node.reference_accesses method. The reference_accesses of a KernelCall is very similar to self.arguments.acc_args.

If I do the get_in_out_parameters of a GOcean kernel call I get:

(Pdb) sig_set
{'p_fld%data', 'cu_fld', 'cu_fld%data', 'u_fld%data', 'p_fld', 'u_fld'}
(Pdb) inp, out = DependencyTools().get_in_out_parameters(self.children)
(Pdb) inp
[Signature(cu_fld%internal%xstart), Signature(cu_fld%internal%xstop), Signature(cu_fld%internal%ystart), Signature(cu_fld%internal%ystop), Signature(p_fld), Signature(u_fld)]
(Pdb) out
[Signature(cu_fld), Signature(i), Signature(j)]

which essentially conveys the same information. With the differences: 1) Loop variables i and j are removed (because their lifecycle is entirely inside the data region?).

- GOCean and Dynamo do this in the acc_args logic.
- Nemo does this in the NemoACC{Update|EnterData}Directive lowering.

Both hardcode the loop variables to delete: e.g. for NEMO 'ji', 'jj', 'jk', 'jt'

This is not optimal:
- Other variables with their full lifetime inside the region are not removed.
- If this names are used for other things and used outside the region, they get removed anyway.

I think we have the tools to do a proper lifetime analysis and delete the unneeded variables without hardcoding.
(this will also get rid of the specialized classes - Point 5 in Nuno's previous comment)

2) The APIs also do the explicit pointer attach expression e.g. p_fld, p_fld%data, ... , but this again seems a generic functionality.

3) Finally the API does not provide the cu_fld%internal%xstart expressions. I am not sure why? Is this because these are scalars or some other reason?

Finally, GOcean has infrastructure support to do the host updates thanks to a callback call when doing field%get_data(). This was very successful in avoiding unneeded preventive data copies. We can consider this for LFRic.

arporter commented 1 year ago

I've just tried this functionality on the SIR-compliant form of the tracer-advection benchmark (being added in https://github.com/stfc/PSycloneBench/issues/92). The compute subroutine is handled fine but, strictly speaking, I need to transform the driver code too (otherwise the checksum computed on the CPU is wrong). If I do that then, for the main 'time-stepping' loop, I get:

...
!$acc update if_present host(itn_count)
do jt = 1, itn_count, 1
  !$acc update if_present host(jpi,jpj,jpk,mydomain,pun,pvn,pwn,rnfmsk,rnfmsk_z,tmask,tsn,umask,upsmsk,vmask,zind,zslpx,zslpy,ztfreez,zwx,zwy)
  call tra_adv_compute(zind, tsn, ztfreez, rnfmsk, rnfmsk_z, upsmsk, tmask, zwx, zwy, umask, vmask, mydomain, zslpx, zslpy, pun, pvn, pwn, jpi, jpj, jpk, jt)
enddo
!$acc update if_present host(step_timer)
call timer_stop(step_timer)
...

which means we'll be copying every array over to the GPU for every 'time step'. I've a feeling we touched on this issue in the original review. In this case, I think we could identify that this transfer is unnecessary by checking that all of the arguments to the Call are either Reference or Literal?

arporter commented 1 year ago

Actually, I've just realised that we'll in fact get wrong answers because the updates in the time-stepping loop will over-write the values on the GPU with out-of-date CPU values.

sergisiso commented 1 year ago

I didn't realize before I approved #1952 the PR (ops!) but there are a couple of lines without test coverage on the OpenACC Enter Data implementation. This should be covered on the next PR please.

[x] src/psyclone/psyir/nodes/acc_directives.py :: 859

if not isinstance(if_present, bool):
    raise TypeError

[x] src/psyclone/psyir/transformations/acc_update_trans.py :: 401
```
else:
from psyclone.psyir.nodes import ACCUpdateDirective
```

arporter commented 1 year ago

1996 has fixed this missing coverage.

arporter commented 9 months ago

In some experimental work looking at improving Signature to take a Symbol as well as text, I've hit Nuno's original point 6. above - the ACCUpdateDirective constructor currently takes a list of Signatures and the various *Arguments.acc_args methods just return a list of strings.

stfc / PSyclone

[To-Do List] OpenACC enter data and update directives implementation #1872

1996 has fixed this missing coverage.