payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Porting payu to non-NCI machinery #323

Open ChrisC28 opened 2 years ago

ChrisC28 commented 2 years ago

There has been some discussion in the context of a MOM6 project about the potential for porting Payu to non-NCI hardware.

I would be interested in getting payu up and running on the Pawsey HPC system. This system uses Cray architecture with slurm as a scheduler.

I notice that there appears to be some support for slurm in Payu, as there is a slurm schedular class. However, I'm unsure of the process for porting payu to another system.

aidanheerdegen commented 2 years ago

Initial work to port to pawsey is here

https://github.com/payu-org/payu/pull/326

aidanheerdegen commented 2 years ago

Thanks @aidanheerdegen I appear to have got the double_gyre case working. Regarding payu @Pawsey, it seems to be working "well enough" for now. Only thing I'd add is that it would be useful to put the work directory on \scratch for more realistic runs. We can deal with additional support for Pawsey when and if it arises and I may be able to get some support from CSIRO for that. I'm going to make an attempt to get eac_10 working today. Almost certain I'll run into an issue I can't solve, so, thanks in advance for the continuing help! It's been quite a while (~9 years) since I used payu or MOM.

It is simple to change the laboratory location. Just set shortpath

shortpath: /scratch/pawsey0410/

It seems pawsey are pretty enthusiastic with purging scratch, so it would make sense to keep your executables and input dirs on /group and use full paths to them, and use some sort of synching to copy the data to /group. That way the laboratory can be deleted and reinstated pretty much automatically.

There are examples of auto-synching scripts in the COSIMA experiment repos, e.g.

https://github.com/COSIMA/1deg_era5_iaf/blob/master/sync_data.sh

This is invoked with an option in config.yaml

https://github.com/COSIMA/1deg_era5_iaf/blob/master/config.yaml#L79

ChrisC28 commented 2 years ago

Recently noticed some oddities with project accounting on Pawsey. Essentially, my project wasn't being debited.

Turns out that the slurm argument equivalent to the -P pbs argument is -A (for Account ). As far as I can tell rummaging around in the code, the slurm scheduler does not pass a project argument.

A single line of code should fix the problem: pbs_flags.append('-A {project}'.format(project=pbs_project)) However, the relevant code on Pawsey is read-only.

aidanheerdegen commented 2 years ago

Sorry Chris, this slipped through the cracks. Feel free to ping me again if it looks like I've forgotten.

I have updated the payu version on magnus with this change.

The modified payu code is in this PR https://github.com/payu-org/payu/pull/326

I can step you through the process of building your own conda environment with this modified code if that is useful.

reillyja commented 9 months ago

Hi - I've just got a quick clarification question about tracking changes before I summarise the latest Setonix issues.

Just a simple workflow question as my github skills are still in their infancy;

Thanks!

angus-g commented 9 months ago

Is this the correct way of importing/editing the scripts?

No, you should use pip install -e ., which means that only a link is installed into the lib path. When you edit the contents of the repository, that's reflected in the module you import.

reillyja commented 9 months ago

Firstly - after cloning my forked payu repo, using pip install -e . the working directory for editing the scripts was assigned to $MYSOFTWARE/conda_install/lib/python3.10/site-packages/payu-1.0.19-py3.10.egg/ (i.e. instead of the default $MYSOFTWARE/setonix/python/lib/python3.10/site-packages/payu/ directory when using pip install .. Any ideas why that would be?

Nonetheless, I've made a couple of edits to envmod.py based on Dale Roberts' comments in this Hive post and also just commented out a couple of lines of slurm.py. Other than that, it's identical to the current master branch.

The error I'm getting now comes out of the mom6.err file as: /scratch/pawsey0410/jreilly/mom6/work/eac_sthpac-forced_v3/MOM6-SIS2: error while loading shared libraries: libnetcdf.so.19: cannot open shared object file: No such file or directory

I tried making the system software path in the envmod.py file more specific, directing it to /software/setonix/2022.11/ instead of just /software/setonix/ however that caused other issues for software required from the 2023.08 path.

I'm starting to think it might just be easier to recompile the model so that everything points to the 2023.08 software directories. Any other comments on this?

My forked repo is at https://github.com/reillyja/payu btw.

angus-g commented 9 months ago

the working directory for editing the scripts was assigned to [...] Any ideas why that would be?

Did you have a conda environment activated?

I'm starting to think it might just be easier to recompile the model so that everything points to the 2023.08 software directories. Any other comments on this?

Per September Pawsey update, the 2022.11 environment is no longer supported due to other changes. It would be significantly easier to rebuild the model than to fiddle with all the required path changes.

dsroberts commented 9 months ago

Hi @reillyja, these changes look fine for what you're trying to do. Did you end up getting the missing shared library issue resolved? To properly backport these changes, there would need to be a new config option(s) to take into account Lmod and possibly the Cray environment as well. The module unload step would have to go for Cray systems, but may be able to be left in for other Lmod+Spack systems.

Actually, I just had a thought. If core_modules (https://github.com/reillyja/payu/blob/master/payu/experiment.py#L37C34-L37C34) can be moved to a config option, you could populate that with the standard list of modules loaded on login on Setonix, rather than skipping the module unload steps entirely. This is safer I think as submitting jobs on Setonix works like having qsub -V specified on Gadi, meaning all modules you've loaded in your current session are carried through to the job. I'm not a fan of this, it leads to very inconsistent environments between jobs.

ChrisC28 commented 9 months ago

Hi all,

Thanks for your help in this matter.

Noting that at least one issue appears to be related to the compilation of the model, does this issue belong here? A port of payu to Setonix (and slurm more generally) would be extremely welcome, but may be unrelated to issues we are discussing here.

Should I open another issue/Hive discussion regarding model recompilation? It's currently compiled with gfortran, but I'd like to at least test a version using the cray compiler suite (we had issue compiling FMS in the past with the Cray compilers, and I might need some help there).

aidanheerdegen commented 9 months ago

We had a meeting with @reillyja and @ChrisC28 and managed to get MOM6+SIS2 compiled under the updated environment as suggested by @angus-g

https://github.com/payu-org/payu/issues/323#issuecomment-1726823628

It required some modifications to the FMS cmake config. FMS built ok using Angus' build config

https://github.com/angus-g/mom6-cmake

but when we tried to use this compiler library in the mom6 build it complained about some non-existent build directories in FMS.

Removing references to mosaic2/include and column_diagnostics/include from the CMakeLists.txt and re-running cmake in the FMS build dir solved the issue:

https://github.com/NOAA-GFDL/FMS/blob/main/CMakeLists.txt#L363 https://github.com/NOAA-GFDL/FMS/blob/main/CMakeLists.txt#L369 https://github.com/NOAA-GFDL/FMS/blob/main/CMakeLists.txt#L380

VanuatuN commented 3 weeks ago

Hi @ChrisC28 @reillyja @dsroberts,

I'm currently trying to run ACCESS-OM2 with SLURM on Leonardo supercomputer in Italy (CINECA). All executables are compiled after a month of struggling.

Did you manage to run outside of Gadi with slurm?

Any help and comments are much appreciated, I would be very much grateful for any advice.

Thanks Natalia