stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
104 stars 27 forks source link

Extend integration tests to run 'LFRic' on GPU #2663

Closed arporter closed 1 week ago

arporter commented 2 months ago

Up to now, out testing of this locally has been ad-hoc. It would be good to capture what we have so far (optimisation script and compiler options) in the integration tests.

Currently however, I don't think that LFRic can be built with the latest NVIDIA compiler (24.5) without some source modifications to work around compiler bugs. Perhaps we could patch those in as part of the test?

arporter commented 2 months ago

We could also do with a working Spack recipe for the LFRic dependencies with the NVIDIA compiler. Currently we only have that for gcc. However, I think NVIDIA have one that they have shared with the Met Office.

https://github.com/MetOffice/simit-spack

arporter commented 2 months ago

We now have a working Spack recipe in psyclone_spack (private repo but could be made public @sergisiso?) for the LFRic dependencies so I think we can proceed with this.

sergisiso commented 2 months ago

It started as a public repo but I closed it because Simit is private anyway, and it redistributes a tar file with rose_picker, which I can do (is GPL) but is also private.

I didn't want to step on anybody's toes by having this public, but I raised the questions with MetOffice about why are those private.

arporter commented 1 month ago

Can load lfric-build-environment%nvhpc once /apps/spack/psyclone-spack activated. (Note that this is using 24.5 and 24.7 is now available.) Currently get missing file: lfric_apps/applications/gungho_model/working/lfric_core/infrastructure/build/fortran/nvfortran.mk

arporter commented 4 weeks ago

Copy in nvfortran.mk and nvc++.mk from where I was working before. Build is successful (which is an improvement over earlier versions of the compiler) but run seg. faults in the namelist handling:

#6  0x00000000006ca0f9 in key_value_mod::get_key_value_key (key=..., self=...) at key_value/key_value_mod.f90:290
#7  0x000000000062e5c6 in namelist_item_mod::get_key (key=..., self=...) at configuration/namelist_item_mod.f90:448
#8  0x0000000000631bae in namelist_mod::locate_member (self=..., name=...) at configuration/namelist_mod.f90:475
#9  0x000000000063136f in namelist_mod::get_str_value (self=<error reading variable: Cannot access memory at address 0x110>, name=<error reading variable: Cannot access memory at address 0x8>, value=...) at configuration/namelist_mod.f90:299
#10 0x000000000065a6e4 in driver_time_mod::init_time (modeldb=...) at driver_time_mod.f90:70
#11 0x00000000004ed181 in gungho_model () at gungho_model.f90:67
arporter commented 4 weeks ago

Use Lukas' patch files (and the corresponding revisions of lfric_core/apps):

 1019  svn checkout  https://code.metoffice.gov.uk/svn/lfric/LFRic/trunk@r50610 lfric_core_r50610
 1020  svn checkout  https://code.metoffice.gov.uk/svn/lfric_apps/main/trunk@r2222 lfric_apps_r2222
 1026  ln -s lfric_core_r50610 lfric
 1027  patch -p1 < ~/lfric.patch
 1030  ln -s lfric_apps_r2222 lfric_apps
 1031  patch -p1 < ~/lfric_apps.patch

The resulting code builds and runs successfully using nvhpc 24.5 (on CPU).

arporter commented 4 weeks ago

Now build for GPU. PSyclone is failing:

$ psyclone -api lfric -d ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/build_gungho_model  --config ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/lfric_core/etc/psyclone.cfg -s applications/gungho_model/optimisation/psyclone-test/global.py ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/build_gungho_model/algorithm/intermesh_mappings_alg_mod.x90
...
Transforming invoke 'invoke_16' ...
Module inlining kernel 'prolong_w2_kernel_code'
    Skipped dofs, arg position 9, function space any_discontinuous_space_2
Generation Error: symbol argument in create method of ArrayReference class should be a DataSymbol but found 'NoneType'.

I need to stick a pdb breakpoint in the appropriate place and repeat...

arporter commented 3 weeks ago

ParallelLoopTrans.validate() -> LFRicLoop.independent_iterations() -> ... -> Loop.reference_accesses() -> Loop.stop_expr() -> DynIntergrid.last_cell_var_symbol() returns None because no colouring information has been supplied yet. This information should be supplied via a call to DynIntergrid.set_colour_info(). In turn, this is called by DynMeshes._colourmap_init() which is called by DynMeshes.declarations() (which generates mesh-related declarations). i.e. the colouring information for an inter-grid kernel is only updated at code generation.

I don't understand why we haven't seen this before as we regularly apply colouring and OMP parallelisation to loops. Perhaps we have a different validation path here?

This script is using ACCLoopTrans which is a generic transformation (and thus does more validation checks) while the original LFRic script uses DynamoOMPParallelLoopTrans.

arporter commented 3 weeks ago

Change to using the most 'recent' script I had for Gravity Wave. Had to make sure the kernel-output directory was set correctly in psyclone.mk. Code builds and runs (with GPU activity) but stops: ERROR: BLOCK_GCR solver_algorithm: NOT converged in 1 iters, Res= 0.37132866E+00 (This is with -gpu=managed.)

arporter commented 3 weeks ago

Same happens for GravityWave: ERROR: GCR solver_algorithm: NOT converged in 20 iters, Res= 0.27603992E-01

arporter commented 3 weeks ago

Tried going back to the script that Lukas sent back to me. (It's clear he's using an older version of PSyclone because I had to fix the location of the import of ACCKernelsTrans.) 'Fix' the problem with the colouring information by updating it when the stop-cell is requested. However, compilation then fails because we've added !$acc routine to a kernel that corresponds to an interface but that has broken the interface:


  interface operator_setval_x_kernel_0_code
     module procedure :: operator_setval_x_kernel_code_r_single, &
          operator_setval_x_kernel_code_r_double, &
          operator_setval_x_kernel_code_r_single_to_r_double, &
          operator_setval_x_kernel_code_r_double_to_r_single
  end interface operator_setval_x_kernel_0_code
``
and we have those kernels (without `!$acc routine` added to them) and then one named `operator_setval_x_kernel_0_code` that we *have* added `acc routine` to!
arporter commented 3 weeks ago

It turns out that LFRicKern.get_kernel_schedule() is being smart when the kernel corresponds to an interface: https://github.com/stfc/PSyclone/blob/e9748d78263b517bb4dd216e0c08c69e4e8b0cac/src/psyclone/domain/lfric/lfric_kern.py#L682-L690 and thus we only get one Schedule back. However, when we proceed to modify this Schedule we mess up the name of the routine to be modified. Should we transform all routines in the interface or, get rid of the interface altogether?

Since we proceed to write a new kernel out to file, I think it might be best if that new kernel had its metadata updated such that it is just for the required precision.

arporter commented 3 weeks ago

The trouble is, we only know we want to create a new kernel in the transformation itself, not in get_kernel_schedule(). AFAICT, there's currently no way of asking a kernel whether its implementation is behind an interface. This is going to be fixed by #1946 which I started work on a long time ago but haven't progressed. At the moment, the simplest "solution" (while we implement a proper fix) is to change get_kernel_schedule() so that it raises NotImplementedError if it encounters a kernel that contains more than one subroutine. I've tried this and it only breaks 3 existing tests (inlining and those for get_kernel_schedule itself). However, I'm a bit worried that this will break the kernel extraction @hiker ?

arporter commented 3 weeks ago

Actually, I've found a way of checking whether we have a mixed-precision kernel from within the transformation so I can simply exclude such cases from all GPU-related kernel transformations. No need to affect other functionality :-)

arporter commented 3 weeks ago

With that change, GW builds and runs on GPU :-)

arporter commented 3 weeks ago

So does GungHo :-) Checksums don't match those in our integration tests but this is with an older revision of lfric_apps (2222 instead of 3269):

$ cat gungho_model-checksums.txt
Inner product checksum rho = 40D0CE6B59340FBF
Inner product checksum theta = 41FCE89D7EAA5606
Inner product checksum u = 45066B6D2DFA78A0

Rebuild with 'default' PSyclone script (i.e. no OpenACC) and run. Still doesn't match:

$ cat gungho_model-checksums.txt
Inner product checksum rho = 40D0CE6B5933ECA6
Inner product checksum theta = 41FCE89D7EA93942
Inner product checksum u = 45066B6D2DEF5DFC
arporter commented 3 weeks ago

Script was incorrectly forcing parallelisation of all dof loops and some contain reductions. Also, was not doing redundant computation. Fix those two things and GPU run gives:

Inner product checksum rho = 40D0CE6B59340FB2
Inner product checksum theta = 41FCE89D7EAA5610
Inner product checksum u = 45066B6D2DFA787C

This is for 10 steps and the namelist file also differs in other ways from out integration test.

arporter commented 3 weeks ago

Profile of a single time step is mostly white space but it's a start:

image
arporter commented 3 weeks ago

Next step is to move what I have into the account that runs the integration tests and check that the source code changes I've made haven't broken anything.