payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 26 forks source link

Work with Environment Modules 4 on Gadi #209

Closed penguian closed 4 years ago

penguian commented 4 years ago

Today I tried running on Gadi. Building succeeded. Unfortunately, I was unable to run the 1deg_jra55_ryf experiment. The error in payu/envmod.py line 40 is

FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath’

This error is essentially because payu is written for Environment Modules version 3.2.6 on Rajin, and the version of Environment Modules on Gadi is 4.3.0. This version of Environment Modules is backwards incompatible with version 3.2.6. See https://modules.readthedocs.io/en/latest/diff_v3_v4.html Therefore payu needs to be changed to be compatible with Environment Modules version 4, in particular the current configuration on Gadi. It may also be possible that the configuration of Environment Modules on Gadi could change.

See also pull request #128 and issue #200.

aidanheerdegen commented 4 years ago

The proximate cause of this issue is that the module environment variables are not consistent with how they're configured on raijin.

On raijin the following environment variables are set:

MODULE_VERSION=3.2.6
MODULESHOME=/opt/Modules/3.2.6

on gadi the only equivalent is

MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl

Without a code change the above issue is solved by setting MODULE_VERSION:

export set MODULE_VERSION=v4.3.0

I think this should be the first fix, but a code fix to extract MODULE_VERSION from MODULES_CMD, or just use MODULES_CMD directly should be the next step.

aidanheerdegen commented 4 years ago

Once the above issue was solved another error occurred:

Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/bin/payu-run", line 10, in <module>
    sys.exit(runscript())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/experiment.py", line 505, in run
    'libmpi.so'
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/envmod.py", line 102, in lib_update
    mod_name, mod_version = fsops.splitpath(lib_path)[2:4]
ValueError: not enough values to unpack (expected 2, got 0)

This is due to ldd not resolving dynamic library paths for my executable:

(Pdb) p slibs
['\tlinux-vdso.so.1 (0x00007ffc36f3f000)', '\tlibnetcdff.so.5 => not found', '\tlibnetcdf.so.7 => not found', '\tlibmpi_usempif08.so.11 => not found', '\tlibmpi_usempi_ignore_tkr.so.6 => not found', '\tlibmpi_mpifh.so.12 => not found', '\tlibmpi.so.12 => not found', '\tlibifport.so.5 => not found', '\tlibifcore.so.5 => not found', '\tlibimf.so => not found', '\tlibsvml.so => not found', '\tlibm.so.6 => /lib64/libm.so.6 (0x0000153a2660c000)', '\tlibintlc.so.5 => not found', '\tlibpthread.so.0 => /lib64/libpthread.so.0 (0x0000153a263ec000)', '\tlibc.so.6 => /lib64/libc.so.6 (0x0000153a26028000)', '\tlibgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000153a25e10000)', '\tlibdl.so.2 => /lib64/libdl.so.2 (0x0000153a25c0c000)', '\t/lib64/ld-linux-x86-64.so.2 (0x0000153a2698e000)', '']

Presumably the solution is to recompile on gadi to ensure all the libraries can be found. Will try this now.

marshallward commented 4 years ago

Not sure if this is helpful or a distraction, but much of this code was based on an init script provided by the package:

/opt/Modules/3.2.6/init/python

It looks like the new script is here:

/opt/Modules/v4.3.0/init/python.py

If you can get that path sorted out, say $MODULESHOME it might help to make things more portable?

penguian commented 4 years ago

In config.yaml I already had

qsub_flags: -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027

when I got the error reported above. I am running again to double check. My config.yaml is at

gadi:/scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/config.yaml
penguian commented 4 years ago

payu run now results in

[pcl900@gadi-login-03 1deg_jra55_ryf]$ more  1deg_jra55_ryf.e148947
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/subcommands/run_cmd.py", line 116, in runscript
    run_args.lab_path)
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/laboratory.py", line 30, in __init__
    raise ValueError('Cannot determine model type.')
ValueError: Cannot determine model type.
aidanheerdegen commented 4 years ago

Hi @penguian,

That error suggests there is no

model: access-om2

line in your config.yaml file

https://github.com/COSIMA/1deg_jra55_ryf/blob/master/config.yaml#L12

I guess you know you're running payu from your ~/.local directory?

penguian commented 4 years ago

I tried again, and now the result is

[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148960
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/experiment.py", line 441, in run
    envmod.setup()
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/envmod.py", line 40, in setup
    with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'
aidanheerdegen commented 4 years ago

I have managed to successfully run an MITgcm simulation, a small test configuration that does not have to be submitted to the queue, once I recompiled for gadi available libraries.

@penguian can you try altering your PATH to pick up the conda/analysis3 version of payu so we can rule out any issues with a differing codebase.

penguian commented 4 years ago

The result is:

[pcl900@gadi-login-03 1deg_jra55_ryf]$ module use /g/data3/hh5/public/modules
[pcl900@gadi-login-03 1deg_jra55_ryf]$ module load conda/analysis3
[pcl900@gadi-login-03 1deg_jra55_ryf]$ which payu
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu sweep
Moving log 1deg_jra55_ryf.e148974
Moving log 1deg_jra55_ryf.o148974
Removing work path /scratch/fp0/pcl900/access-om2/work/1deg_jra55_ryf
Removing symlink /scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/work
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu run
qsub -q normal -P fp0 -l walltime=14400 -l ncpus=288 -l mem=500GB -N 1deg_jra55_ryf -l wd -j n -v LD_LIBRARY_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib,PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027 -- /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/python /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run
148975.gadi-pbs
[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148975
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/experiment.py", line 443, in run
    envmod.setup()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/envmod.py", line 40, in setup
    with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'

Perhaps I have misconfigured?

aidanheerdegen commented 4 years ago

That code branch is only called if MODULE_VERSION is not defined

https://github.com/payu-org/payu/blob/89375e9960c5769ac54e1c8dd89334a895cff891/payu/envmod.py#L40

Can you set this and try again?

export set MODULE_VERSION=v4.3.0
penguian commented 4 years ago

The result is:

[pcl900@gadi-login-03 1deg_jra55_ryf]$ export set MODULE_VERSION=v4.3.0
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu sweep
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu run
qsub -q normal -P fp0 -l walltime=14400 -l ncpus=288 -l mem=500GB -N 1deg_jra55_ryf -l wd -j n -v LD_LIBRARY_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib,PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027 -- /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/python /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run
148978.gadi-pbs
[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148978
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/experiment.py", line 443, in run
    envmod.setup()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/envmod.py", line 40, in setup
    with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'
penguian commented 4 years ago

In the setup() code, the check on line 38 is for 'MODULEPATH', not MODULE_VERSION.

aidanheerdegen commented 4 years ago

Yep, but MODULEPATH is set above that, and uses MODULE_VERSION

https://github.com/payu-org/payu/blob/89375e9960c5769ac54e1c8dd89334a895cff891/payu/envmod.py#L25

penguian commented 4 years ago

No, that is moduleshome. The environment variable MODULEPATH is not set until line 46, which is after the FileNotFoundError.

aidanheerdegen commented 4 years ago

You're right, sorry.

So you don't have MODULEPATH defined by the looks of it. I wonder why?

penguian commented 4 years ago

It is because MODULEPATH is defined by the Environment Modules TCL code on Gadi rather than the way it is done on Raijin. See /opt/Modules/v4.3.0/init/python.py and /opt/Modules/v4.3.0/libexec/modulecmd.tcl.

penguian commented 4 years ago

On Raijin, the script /etc/profile.d/nf_sh_modules is run on login, and this calls $modules_path/init/bash, setting up the environment variables. On Gadi, this script does not exist, but there is a script /etc/profile.d/modules.sh which should do something similar. I will check.

penguian commented 4 years ago

Looging in to Gadi, I see:

Last login: Mon Nov 18 14:07:27 2019 from 150.203.248.245
[pcl900@gadi-login-01 ~]$ echo $MODULESHOME
/opt/Modules/v4.3.0
[pcl900@gadi-login-01 ~]$ echo $MODULEPATH
/apps/Modules/restricted-modulefiles/z00:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles
[pcl900@gadi-login-01 ~]$ echo $MODULE_VERSION

That is MODULEPATH and MODULESHOME are defined in the interactive login shell, but not MODULE_VERSION.

penguian commented 4 years ago

I succeeded in starting the model by setting:

qsub_flags: -v MODULE_VERSION=v4.3.0,MODULEPATH=/g/data3/hh5/public/modules:/apps/Modules/restricted-modulefiles/z00:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulef
iles:/apps/Modules/modulefiles -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027

in config.yaml. The following also works, as long as MODULE_VERSION and MODULEPATH are defined in the interactive shell:

qsub_flags: -V -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027

See /scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/config.yaml

(FYI, but not relevant to payu: I also copied /g/data1/ua8/JRA55-do/RYF to /scratch/fp0/pcl900/ and adjusted atmosphere/forcing.json to match, as well as setting ncpus for ocean back to 216 in config.yaml. The model now fails with SIGSEGV in ice_transport_remap.f90.)

aidanheerdegen commented 4 years ago

This fix only worked for payu-run. With payu run it fails as payu only passes a limited environment to the PBS job, and so none of the module environment variables were being passed in.

Paul's work-around

qsub_flags: -V 

ensures the environment is fully populated, so it works in that case.

This needs a another fix to properly populate the PBS environment with module environment variables.