[nemo] Array-valued function needs to be inlined for OpenACC performance

stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran

BSD 3-Clause "New" or "Revised" License

103 stars 28 forks source link

[nemo] Array-valued function needs to be inlined for OpenACC performance #924

Open arporter opened 3 years ago

arporter commented 3 years ago

For example: sbcblk.f90, cp_air() function.

Things that still need doing:

[ ] handling the inlining of routines that call other routines (https://github.com/stfc/PSyclone/pull/2413#discussion_r1692776348);
[ ] understand why profiling (in acc_kernels_trans) interacts with inlining and maybe drop it in favour of the --profile command-line flag;
[ ] add support for inlining of routines called via a named interface;
[ ] support for routines with named arguments;

arporter commented 3 years ago

e.g. if we have some code:

my_array(:,:) = cp_air(array1(:,:)) * array2(:,:) + array3(:,:)

where cp_air is an array-valued function that performs compute then this is hard to parallelise/ensure it remains GPU resident. (If we put ACC KERNELS inside cp_air then we can't have it around this assignment.) Ideally we need to refactor this:

my_tmp(:,:) = cp_air(array1(:,:))
my_array(:,:) = my_tmp(:,:)*array2(:,:) + array3(:,:)

That then allows us to have KERNELS inside cp_air() and, separately, around the assignment.

Better still would be to in-line cp_air but we have to walk before we can run.

rupertford commented 3 years ago

If I remember correctly, some of the existing kernel transformations (that replace intrinsics) have logic similar to the above. Perhaps the splitting out of an intrinsic/function should have its own transformation which can then be shared.

arporter commented 3 years ago

Yes, I said as much to Chris when we were discussing this. Hopefully there's some reuse possible.

arporter commented 3 years ago

Obviously, a step 0 for this work is to be able to identify array-valued functions. Currently, all RoutineSymbols that represent Fortran functions are given DeferredType (#1294).

arporter commented 2 years ago

This is going quite well. I need to handle the case where there's a name clash for a ContainerSymbol as we can't simply rename it in that case.

arporter commented 2 years ago

Note to self: I need to identify when a routine accesses variables made available from some outer scoping region as this prevents inlining.

arporter commented 2 years ago

The first review and subsequent work on validate has thrown up a number of things that I'm adding TODOs for this time around:

[ ] validate will be much nicer once we can query the datatype of any Reference (#1799)
[ ] Range needs extending to return the PSyIR for the number of elements it represents
[ ] Support needs to be added for routines with named arguments.

sergisiso commented 2 years ago

The initial implementation of InlineTransformation is now on master, but some remaining TODOs need to be fixed to close this issue.

sergisiso commented 1 year ago

This is also one of the major issues in OpenMP for GPU now, mostly in the sbcblk region. Some functions that will need to be inlined are: q_sat, sbc_dcy, gamma_moist, cd_neutral_10m, psi_h, psi_m

arporter commented 1 year ago

Although the basic inlining functionality is on master, it still requires that the routine to be inlined be in the same Container as the call site. That means we need module inlining working in native PSyIR (currently I think it only works for PSyKAl APIs).

sergisiso commented 1 year ago

Alternatively, now that we have the elemental attribute, we can a resolve the cp_air symbol (which could be imported) for: my_array(:,:) = cp_air(array1(:,:)) * array2(:,:) + array3(:,:) and then arrayrange2loop can decide how to convert it into loop-form depending if it is elemental or not.

(Ideally PSyclone should be able to do both, and the transformation script just chose how to approach it)

arporter commented 10 months ago

Currently a CodedKern has a get_kernel_schedule method. To generalise the KernelModuleInlineTrans transformation we need to support Call as well. Call has routine which gives the corresponding RoutineSymbol. In the work I'm doing now I propose to add a RoutineSymbol.get_schedule method which is very like the ContainerSymbol.container method. I could name it schedule but, since it potentially does a lot of searching, I prefer the get_ prefix. I'd argue that the container method could also do with a get_ prefix.

sergisiso commented 10 months ago

I agree with the get_ but maybe it could be Routine.Symbol.get_routine() instead of "schedule" to make it easier for people not familiar with the PSyIR concepts.

(PS: Now I would also rename Container to Module for the same reason, and I think I am to blame for picking the wrong name)

arporter commented 9 months ago

(PS: Now I would also rename Container to Module for the same reason, and I think I am to blame for picking the wrong name)

Well, we do aspire to be language-neutral so I think Container is fine. At least 'routine' is a generic concept :-)

arporter commented 7 months ago

Chatting with Martin and Hugo, it turns out that CROCO doesn't have modules but they would still like to be able to do inlining. This would mean we would need to insert routines into the FileContainer instead (or, we just extend InlineTrans to search for the Routine outside of the current Container?).