stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
103 stars 28 forks source link

[LFRic] GPU data movement #2197

Open arporter opened 1 year ago

arporter commented 1 year ago

Having played with the GungHo, GravityWave and Skeleton mini-apps with nvfortran and GPU, it seems that the memory movement issue is going to be a significant problem. This is because there ends up being overlap between different derived types (proxy objects) since these are constructed/copied such that they share (point to) certain elements.

This seems to cause problems with every data-movement strategy I've tried (managed memory, explicit enter-data directives and relying on the compiler-generated data-movement directives).

arporter commented 1 year ago

Ideally, I need to create a small example that shows this problem (eg 14 doesn't do any of this field copying).

rupertford commented 1 year ago

If we dereference on the CPU so only pass raw arrays to the GPU would that solve the problem? Obviously this would take a while to implement if it is a solution and could be part of the re-structuring of the LFRic PSy-layer generation.

arporter commented 1 year ago

Yes, that was what I was thinking - it's annoying because it really is only the pointers to the data that we care about (and pass to the kernel) but OpenACC gets its knickers in a twist by trying to look at the whole structure.

arporter commented 1 year ago

It turns that using the ASSOCIATE construct seems to work, e.g.:

  ASSOCIATE ( div_stencil => divergence_proxy%local_stencil, &
       div_ncell_3d => divergence_proxy%ncell_3d, &
       field_1_data => field_1_proxy%data, &
       field_2_data => field_2_proxy%data)
  !$acc enter data  &
  !$acc& copyin(cmap,div_stencil,div_ncell_3d,field_1_data, &
  !$acc& field_2_data,map_aspc1_field_1,map_aspc2_field_2,ndf_aspc1_field_1,ndf_aspc2_field_2,nlayers, &
  !$acc& undf_aspc1_field_1,undf_aspc2_field_2)

Doing this lets the Skeleton mini-app run through to completion without errors. (The reported result is wrong because the infrastructure never pulls data back from the GPU.)

arporter commented 1 year ago

Probably this means that if we explicitly set-up pointers/local scalar copies then that would also work (as long as we can generate valid Fortran, i.e. the necessary entities have the target attribute). The question then is, which should we go for? This problem of needing to prevent the compiler from examining surrounding structures is going to occur whenever we have to move data to/from the GPU. (I tried just giving the ACC directive e.g. proxy%data rather than proxy,proxy%data but it seems that the NVIDIA compiler has got smarter and it still examined the whole of proxy and gave the same runtime errors.)

arporter commented 1 year ago

Upon discussion with Rupert, we agreed that the bulk of the work here is in changing the field-data accesses (and other things that we de-reference the proxy for) to be a single variable name. Whether that variable is explicitly declared as a pointer or associated using ASSOCIATE is a second-order issue.

arporter commented 1 year ago

I've done the necessary updates for fields and field vectors, including declaration and association of pointer variables. This generates code that compiles and runs on GPU for eg14. I think I just need to tackle operators and then I might be able to do Skeleton...

arporter commented 1 year ago

I was hoping to minimise the (already extensive) changes required by this change to use pointers to field/operator data by using DeferredType for the ptr symbols themselves. However, this then causes the LFRic kernel extraction tests to fail because they use the PSyIR (whoop) which then complains that it can't generate declarations for symbols of DeferredType. This means that I have to pause this while I (at minimum) address #2223.

arporter commented 9 months ago

I've been trying to add static OpenACC data regions to invokes in Gravity Wave and (once I'd worked around some problems in the DA), this fails because the gen_code path used by LFRic is not implemented in the Directive node. In terms of speed, it's probably easiest to implement gen_code but this is a retrograde step since we really want to get rid of this. I could therefore pause this and concentrate on changes towards removing gen_code.