sdsc / spack

A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
https://spack.io
Other
0 stars 4 forks source link

ATS-9795: SPEC - expanse/0.17.3/gpu/b - User requires NetCDF libraries compiled with PGI compilers #125

Closed mkandes closed 2 weeks ago

mkandes commented 3 weeks ago

A user requires NetCDF libaries on Expanse that are compatible with the available PGI compilers for some custom code they intend to use a part of an upcoming class they begin teaching next week. Unfortuantely, the PGI compilers currently deployed in production on Expanse are only a part of the expanse/0.15.4/gpu and expanse/0.17.3/gpu/b Spack instances, whereas the user intends to only run this (non-GPU-acclerated) example code on Expanse's standard compute nodes. And in any case, there are currently no NetCDF libraries for the PGI compiler family in either expanse/0.15.4/gpu and/or expanse/0.17.3/gpu/b.

I reviewed the user's code and attempted to compile it with the NetCDF libraries and other compiler combinations already available in the expanse/0.15.4/cpu and expanse/0.17.3/cpu/b production Spack instances. Unfortunately, the code utilzies some unusual, non-standard Fortran syntax only 'allowed' by the PGI compilers at this time. Both GCC and Intel compilers reported syntax-related issues, which while could be resolved by editing the user's code, this could also become somewhat time consuming depending how far this problem prevades the code --- an initial attempt was made to fix, but it became apparent it could be more significant of a change needed.

A quick and dirty temporary fix for the user has been tested. We can deploy the NetCDF libraries into the expanse/0.17.3/gpu/b instance using the nvhpc/21.9 compiler, BUT build each package and their dependencies on the standard AMD compute nodes.

https://access-ci.atlassian.net/browse/ATS-9795

mkandes commented 3 weeks ago

After another review of thedeployment approach described above, it's likely not the best approach. First, the spack_cpu and spack_gpu deployment role accounts on Expanse cannot write into each others SPACK_ROOT trees from their respective deployment nodes due to how the Ceph keys are assigned to each individually. Moreover, even if this cross-write ability was possible, it's also probably not a great idea to have two Spack user accounts write into different SPACK_ROOT tree due to longer-term ownership concerns.

Instead, the more straightfoward approach is to deploy nvhpc@21.9 and the required NetCDF libraries and dependencies directly to the expanse/0.17.3/cpu/b instance.

mkandes commented 3 weeks ago

Pull request created and merged. https://github.com/sdsc/spack/pull/126

mkandes commented 3 weeks ago

Specs deployed into production successfully.

[mkandes@login01 ~]$ module load nvhpc/21.9
[mkandes@login01 ~]$ module avail

------------- /cm/shared/apps/spack/0.17.3/cpu/b/share/spack/lmod/linux-rocky8-x86_64/nvhpc/21.9 -------------
   hdf5/1.10.7/mwaa2bz    netcdf-c/4.8.1/j4nyqvv    netcdf-fortran/4.5.3/fg4qvqw

---------------- /cm/shared/apps/spack/0.17.3/cpu/b/share/spack/lmod/linux-rocky8-x86_64/Core ----------------
   anaconda3/2021.05/q4munrg             git-lfs/2.11.0/kmruniy           pigz/2.6/bgymyil
   aocc/3.2.0/io3s466                    git/2.31.1/ldetm5y               rclone/1.56.2/mldjorr
   aria2/1.35.0/q32jtg2                  intel/19.1.3.304/6pv46so         sratoolkit/2.10.9/rn4humf
   cmake/3.21.4/n5jtjsf                  matlab/2022b/lefe4oq             subversion/1.14.0/qpzq6zs
   entrezdirect/10.7.20190114/6pkkpx2    mercurial/5.8/qmgrjvl            ucx/1.10.1/wla3unl
   gcc/10.2.0/npcyll4                    nvhpc/21.9/xxpthf5        (L)
   gh/2.0.0/mkz3uxl                      parallel/20210922/sqru6rr

------------------------------------------- /cm/local/modulefiles --------------------------------------------
   shared (L)    singularitypro/3.11 (D)    singularitypro/4.1.2    slurm/expanse/23.02.7 (L)

------------------------------------- /cm/shared/apps/access/modulefiles -------------------------------------
   accessusage/0.5-1    cue-login-env

------------------------------------------- /usr/share/modulefiles -------------------------------------------
   DefaultModules (L)    cpu/0.17.3b (c,L,D)    gpu/0.17.3b    (g,D)    nostack/0.17.3b (e,D)
   cpu/0.15.4     (c)    gpu/0.15.4  (g)        nostack/0.15.4 (e)

------------------------------------------- /cm/shared/modulefiles -------------------------------------------
   AMDuProf/3.4.475       sdsc/1.0              (L)    slurm/expanse/23.02.7 (D)
   default-environment    slurm/expanse/current

  Where:
   L:  Module is loaded
   c:  built natively for AMD Rome
   e:  not architecture specific
   g:  built natively for Intel Skylake
   D:  Default Module

Module defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.
See https://lmod.readthedocs.io/en/latest/060_locating.html for details.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

[mkandes@login01 ~]$

https://github.com/sdsc/spack/commit/484490cfe5cea6c0f5e386bbf036a4317c067a1a

mkandes commented 3 weeks ago

Additional notes here: https://github.com/mkandes/notes/blob/main/2024/08/16.md

mkandes commented 2 weeks ago

User confirmed deployed solution is working for them. Closing issue.