Closed arporter closed 1 week ago
We could also do with a working Spack recipe for the LFRic dependencies with the NVIDIA compiler. Currently we only have that for gcc. However, I think NVIDIA have one that they have shared with the Met Office.
We now have a working Spack recipe in psyclone_spack (private repo but could be made public @sergisiso?) for the LFRic dependencies so I think we can proceed with this.
It started as a public repo but I closed it because Simit is private anyway, and it redistributes a tar file with rose_picker, which I can do (is GPL) but is also private.
I didn't want to step on anybody's toes by having this public, but I raised the questions with MetOffice about why are those private.
Can load lfric-build-environment%nvhpc once /apps/spack/psyclone-spack activated. (Note that this is using 24.5 and 24.7 is now available.) Currently get missing file: lfric_apps/applications/gungho_model/working/lfric_core/infrastructure/build/fortran/nvfortran.mk
Copy in nvfortran.mk and nvc++.mk from where I was working before. Build is successful (which is an improvement over earlier versions of the compiler) but run seg. faults in the namelist handling:
#6 0x00000000006ca0f9 in key_value_mod::get_key_value_key (key=..., self=...) at key_value/key_value_mod.f90:290
#7 0x000000000062e5c6 in namelist_item_mod::get_key (key=..., self=...) at configuration/namelist_item_mod.f90:448
#8 0x0000000000631bae in namelist_mod::locate_member (self=..., name=...) at configuration/namelist_mod.f90:475
#9 0x000000000063136f in namelist_mod::get_str_value (self=<error reading variable: Cannot access memory at address 0x110>, name=<error reading variable: Cannot access memory at address 0x8>, value=...) at configuration/namelist_mod.f90:299
#10 0x000000000065a6e4 in driver_time_mod::init_time (modeldb=...) at driver_time_mod.f90:70
#11 0x00000000004ed181 in gungho_model () at gungho_model.f90:67
Use Lukas' patch files (and the corresponding revisions of lfric_core/apps):
1019 svn checkout https://code.metoffice.gov.uk/svn/lfric/LFRic/trunk@r50610 lfric_core_r50610
1020 svn checkout https://code.metoffice.gov.uk/svn/lfric_apps/main/trunk@r2222 lfric_apps_r2222
1026 ln -s lfric_core_r50610 lfric
1027 patch -p1 < ~/lfric.patch
1030 ln -s lfric_apps_r2222 lfric_apps
1031 patch -p1 < ~/lfric_apps.patch
The resulting code builds and runs successfully using nvhpc 24.5 (on CPU).
Now build for GPU. PSyclone is failing:
$ psyclone -api lfric -d ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/build_gungho_model --config ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/lfric_core/etc/psyclone.cfg -s applications/gungho_model/optimisation/psyclone-test/global.py ~/LFRic/spack-nvidia/lfric_apps_r2222/applications/gungho_model/working/build_gungho_model/algorithm/intermesh_mappings_alg_mod.x90
...
Transforming invoke 'invoke_16' ...
Module inlining kernel 'prolong_w2_kernel_code'
Skipped dofs, arg position 9, function space any_discontinuous_space_2
Generation Error: symbol argument in create method of ArrayReference class should be a DataSymbol but found 'NoneType'.
I need to stick a pdb breakpoint in the appropriate place and repeat...
ParallelLoopTrans.validate() -> LFRicLoop.independent_iterations() -> ... -> Loop.reference_accesses() -> Loop.stop_expr() -> DynIntergrid.last_cell_var_symbol()
returns None
because no colouring information has been supplied yet. This information should be supplied via a call to DynIntergrid.set_colour_info()
.
In turn, this is called by DynMeshes._colourmap_init()
which is called by DynMeshes.declarations()
(which generates mesh-related declarations). i.e. the colouring information for an inter-grid kernel is only updated at code generation.
I don't understand why we haven't seen this before as we regularly apply colouring and OMP parallelisation to loops. Perhaps we have a different validation path here?
This script is using ACCLoopTrans
which is a generic transformation (and thus does more validation checks) while the original LFRic script uses DynamoOMPParallelLoopTrans
.
Change to using the most 'recent' script I had for Gravity Wave. Had to make sure the kernel-output directory was set correctly in psyclone.mk
. Code builds and runs (with GPU activity) but stops: ERROR: BLOCK_GCR solver_algorithm: NOT converged in 1 iters, Res= 0.37132866E+00
(This is with -gpu=managed
.)
Same happens for GravityWave: ERROR: GCR solver_algorithm: NOT converged in 20 iters, Res= 0.27603992E-01
Tried going back to the script that Lukas sent back to me. (It's clear he's using an older version of PSyclone because I had to fix the location of the import of ACCKernelsTrans.) 'Fix' the problem with the colouring information by updating it when the stop-cell is requested. However, compilation then fails because we've added !$acc routine
to a kernel that corresponds to an interface but that has broken the interface:
interface operator_setval_x_kernel_0_code
module procedure :: operator_setval_x_kernel_code_r_single, &
operator_setval_x_kernel_code_r_double, &
operator_setval_x_kernel_code_r_single_to_r_double, &
operator_setval_x_kernel_code_r_double_to_r_single
end interface operator_setval_x_kernel_0_code
``
and we have those kernels (without `!$acc routine` added to them) and then one named `operator_setval_x_kernel_0_code` that we *have* added `acc routine` to!
It turns out that LFRicKern.get_kernel_schedule() is being smart when the kernel corresponds to an interface: https://github.com/stfc/PSyclone/blob/e9748d78263b517bb4dd216e0c08c69e4e8b0cac/src/psyclone/domain/lfric/lfric_kern.py#L682-L690 and thus we only get one Schedule back. However, when we proceed to modify this Schedule we mess up the name of the routine to be modified. Should we transform all routines in the interface or, get rid of the interface altogether?
Since we proceed to write a new kernel out to file, I think it might be best if that new kernel had its metadata updated such that it is just for the required precision.
The trouble is, we only know we want to create a new kernel in the transformation itself, not in get_kernel_schedule()
. AFAICT, there's currently no way of asking a kernel whether its implementation is behind an interface. This is going to be fixed by #1946 which I started work on a long time ago but haven't progressed. At the moment, the simplest "solution" (while we implement a proper fix) is to change get_kernel_schedule()
so that it raises NotImplementedError
if it encounters a kernel that contains more than one subroutine. I've tried this and it only breaks 3 existing tests (inlining and those for get_kernel_schedule itself). However, I'm a bit worried that this will break the kernel extraction @hiker ?
Actually, I've found a way of checking whether we have a mixed-precision kernel from within the transformation so I can simply exclude such cases from all GPU-related kernel transformations. No need to affect other functionality :-)
With that change, GW builds and runs on GPU :-)
So does GungHo :-) Checksums don't match those in our integration tests but this is with an older revision of lfric_apps (2222 instead of 3269):
$ cat gungho_model-checksums.txt
Inner product checksum rho = 40D0CE6B59340FBF
Inner product checksum theta = 41FCE89D7EAA5606
Inner product checksum u = 45066B6D2DFA78A0
Rebuild with 'default' PSyclone script (i.e. no OpenACC) and run. Still doesn't match:
$ cat gungho_model-checksums.txt
Inner product checksum rho = 40D0CE6B5933ECA6
Inner product checksum theta = 41FCE89D7EA93942
Inner product checksum u = 45066B6D2DEF5DFC
Script was incorrectly forcing parallelisation of all dof loops and some contain reductions. Also, was not doing redundant computation. Fix those two things and GPU run gives:
Inner product checksum rho = 40D0CE6B59340FB2
Inner product checksum theta = 41FCE89D7EAA5610
Inner product checksum u = 45066B6D2DFA787C
This is for 10 steps and the namelist file also differs in other ways from out integration test.
Profile of a single time step is mostly white space but it's a start:
Next step is to move what I have into the account that runs the integration tests and check that the source code changes I've made haven't broken anything.
Up to now, out testing of this locally has been ad-hoc. It would be good to capture what we have so far (optimisation script and compiler options) in the integration tests.
Currently however, I don't think that LFRic can be built with the latest NVIDIA compiler (24.5) without some source modifications to work around compiler bugs. Perhaps we could patch those in as part of the test?