Extend integration tests to run 'LFRic' on GPU

arporter commented 2 months ago

Up to now, out testing of this locally has been ad-hoc. It would be good to capture what we have so far (optimisation script and compiler options) in the integration tests.

Currently however, I don't think that LFRic can be built with the latest NVIDIA compiler (24.5) without some source modifications to work around compiler bugs. Perhaps we could patch those in as part of the test?

arporter commented 2 months ago

We could also do with a working Spack recipe for the LFRic dependencies with the NVIDIA compiler. Currently we only have that for gcc. However, I think NVIDIA have one that they have shared with the Met Office.

https://github.com/MetOffice/simit-spack

arporter commented 2 months ago

We now have a working Spack recipe in psyclone_spack (private repo but could be made public @sergisiso?) for the LFRic dependencies so I think we can proceed with this.

sergisiso commented 2 months ago

It started as a public repo but I closed it because Simit is private anyway, and it redistributes a tar file with rose_picker, which I can do (is GPL) but is also private.

I didn't want to step on anybody's toes by having this public, but I raised the questions with MetOffice about why are those private.

arporter commented 1 month ago

Can load lfric-build-environment%nvhpc once /apps/spack/psyclone-spack activated. (Note that this is using 24.5 and 24.7 is now available.) Currently get missing file: lfric_apps/applications/gungho_model/working/lfric_core/infrastructure/build/fortran/nvfortran.mk

arporter commented 4 weeks ago

Copy in nvfortran.mk and nvc++.mk from where I was working before. Build is successful (which is an improvement over earlier versions of the compiler) but run seg. faults in the namelist handling:

#6  0x00000000006ca0f9 in key_value_mod::get_key_value_key (key=..., self=...) at key_value/key_value_mod.f90:290
#7  0x000000000062e5c6 in namelist_item_mod::get_key (key=..., self=...) at configuration/namelist_item_mod.f90:448
#8  0x0000000000631bae in namelist_mod::locate_member (self=..., name=...) at configuration/namelist_mod.f90:475
#9  0x000000000063136f in namelist_mod::get_str_value (self=<error reading variable: Cannot access memory at address 0x110>, name=<error reading variable: Cannot access memory at address 0x8>, value=...) at configuration/namelist_mod.f90:299
#10 0x000000000065a6e4 in driver_time_mod::init_time (modeldb=...) at driver_time_mod.f90:70
#11 0x00000000004ed181 in gungho_model () at gungho_model.f90:67

arporter commented 4 weeks ago

Use Lukas' patch files (and the corresponding revisions of lfric_core/apps):

 1019  svn checkout  https://code.metoffice.gov.uk/svn/lfric/LFRic/trunk@r50610 lfric_core_r50610
 1020  svn checkout  https://code.metoffice.gov.uk/svn/lfric_apps/main/trunk@r2222 lfric_apps_r2222
 1026  ln -s lfric_core_r50610 lfric
 1027  patch -p1 < ~/lfric.patch
 1030  ln -s lfric_apps_r2222 lfric_apps
 1031  patch -p1 < ~/lfric_apps.patch

The resulting code builds and runs successfully using nvhpc 24.5 (on CPU).

arporter commented 4 weeks ago

Now build for GPU. PSyclone is failing:

$ psyclone -api lfric -d ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/build_gungho_model  --config ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/lfric_core/etc/psyclone.cfg -s applications/gungho_model/optimisation/psyclone-test/global.py ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/build_gungho_model/algorithm/intermesh_mappings_alg_mod.x90
...
Transforming invoke 'invoke_16' ...
Module inlining kernel 'prolong_w2_kernel_code'
    Skipped dofs, arg position 9, function space any_discontinuous_space_2
Generation Error: symbol argument in create method of ArrayReference class should be a DataSymbol but found 'NoneType'.

I need to stick a pdb breakpoint in the appropriate place and repeat...