Closed penguian closed 4 years ago
The proximate cause of this issue is that the module environment variables are not consistent with how they're configured on raijin
.
On raijin the following environment variables are set:
MODULE_VERSION=3.2.6
MODULESHOME=/opt/Modules/3.2.6
on gadi
the only equivalent is
MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl
Without a code change the above issue is solved by setting MODULE_VERSION
:
export set MODULE_VERSION=v4.3.0
I think this should be the first fix, but a code fix to extract MODULE_VERSION
from MODULES_CMD
, or just use MODULES_CMD
directly should be the next step.
Once the above issue was solved another error occurred:
Traceback (most recent call last):
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/bin/payu-run", line 10, in <module>
sys.exit(runscript())
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
expt.run()
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/experiment.py", line 505, in run
'libmpi.so'
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/envmod.py", line 102, in lib_update
mod_name, mod_version = fsops.splitpath(lib_path)[2:4]
ValueError: not enough values to unpack (expected 2, got 0)
This is due to ldd
not resolving dynamic library paths for my executable:
(Pdb) p slibs
['\tlinux-vdso.so.1 (0x00007ffc36f3f000)', '\tlibnetcdff.so.5 => not found', '\tlibnetcdf.so.7 => not found', '\tlibmpi_usempif08.so.11 => not found', '\tlibmpi_usempi_ignore_tkr.so.6 => not found', '\tlibmpi_mpifh.so.12 => not found', '\tlibmpi.so.12 => not found', '\tlibifport.so.5 => not found', '\tlibifcore.so.5 => not found', '\tlibimf.so => not found', '\tlibsvml.so => not found', '\tlibm.so.6 => /lib64/libm.so.6 (0x0000153a2660c000)', '\tlibintlc.so.5 => not found', '\tlibpthread.so.0 => /lib64/libpthread.so.0 (0x0000153a263ec000)', '\tlibc.so.6 => /lib64/libc.so.6 (0x0000153a26028000)', '\tlibgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000153a25e10000)', '\tlibdl.so.2 => /lib64/libdl.so.2 (0x0000153a25c0c000)', '\t/lib64/ld-linux-x86-64.so.2 (0x0000153a2698e000)', '']
Presumably the solution is to recompile on gadi
to ensure all the libraries can be found. Will try this now.
Not sure if this is helpful or a distraction, but much of this code was based on an init script provided by the package:
/opt/Modules/3.2.6/init/python
It looks like the new script is here:
/opt/Modules/v4.3.0/init/python.py
If you can get that path sorted out, say $MODULESHOME
it might help to make things more portable?
In config.yaml
I already had
qsub_flags: -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027
when I got the error reported above. I am running again to double check. My config.yaml
is at
gadi:/scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/config.yaml
payu run
now results in
[pcl900@gadi-login-03 1deg_jra55_ryf]$ more 1deg_jra55_ryf.e148947
Traceback (most recent call last):
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
sys.exit(runscript())
File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/subcommands/run_cmd.py", line 116, in runscript
run_args.lab_path)
File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/laboratory.py", line 30, in __init__
raise ValueError('Cannot determine model type.')
ValueError: Cannot determine model type.
Hi @penguian,
That error suggests there is no
model: access-om2
line in your config.yaml
file
https://github.com/COSIMA/1deg_jra55_ryf/blob/master/config.yaml#L12
I guess you know you're running payu
from your ~/.local
directory?
I tried again, and now the result is
[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148960
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
sys.exit(runscript())
File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/subcommands/run_cmd.py", line 128, in runscript
expt.run()
File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/experiment.py", line 441, in run
envmod.setup()
File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/envmod.py", line 40, in setup
with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'
I have managed to successfully run an MITgcm
simulation, a small test configuration that does not have to be submitted to the queue, once I recompiled for gadi
available libraries.
@penguian can you try altering your PATH
to pick up the conda/analysis3
version of payu
so we can rule out any issues with a differing codebase.
The result is:
[pcl900@gadi-login-03 1deg_jra55_ryf]$ module use /g/data3/hh5/public/modules
[pcl900@gadi-login-03 1deg_jra55_ryf]$ module load conda/analysis3
[pcl900@gadi-login-03 1deg_jra55_ryf]$ which payu
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu sweep
Moving log 1deg_jra55_ryf.e148974
Moving log 1deg_jra55_ryf.o148974
Removing work path /scratch/fp0/pcl900/access-om2/work/1deg_jra55_ryf
Removing symlink /scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/work
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu run
qsub -q normal -P fp0 -l walltime=14400 -l ncpus=288 -l mem=500GB -N 1deg_jra55_ryf -l wd -j n -v LD_LIBRARY_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib,PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027 -- /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/python /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run
148975.gadi-pbs
[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148975
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
sys.exit(runscript())
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
expt.run()
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/experiment.py", line 443, in run
envmod.setup()
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/envmod.py", line 40, in setup
with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'
Perhaps I have misconfigured?
That code branch is only called if MODULE_VERSION
is not defined
https://github.com/payu-org/payu/blob/89375e9960c5769ac54e1c8dd89334a895cff891/payu/envmod.py#L40
Can you set this and try again?
export set MODULE_VERSION=v4.3.0
The result is:
[pcl900@gadi-login-03 1deg_jra55_ryf]$ export set MODULE_VERSION=v4.3.0
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu sweep
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu run
qsub -q normal -P fp0 -l walltime=14400 -l ncpus=288 -l mem=500GB -N 1deg_jra55_ryf -l wd -j n -v LD_LIBRARY_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib,PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027 -- /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/python /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run
148978.gadi-pbs
[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148978
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
sys.exit(runscript())
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
expt.run()
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/experiment.py", line 443, in run
envmod.setup()
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/envmod.py", line 40, in setup
with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'
In the setup()
code, the check on line 38 is for 'MODULEPATH'
, not MODULE_VERSION
.
Yep, but MODULEPATH
is set above that, and uses MODULE_VERSION
https://github.com/payu-org/payu/blob/89375e9960c5769ac54e1c8dd89334a895cff891/payu/envmod.py#L25
No, that is moduleshome
. The environment variable MODULEPATH
is not set until line 46, which is after the FileNotFoundError
.
You're right, sorry.
So you don't have MODULEPATH
defined by the looks of it. I wonder why?
It is because MODULEPATH
is defined by the Environment Modules TCL code on Gadi rather than the way it is done on Raijin. See /opt/Modules/v4.3.0/init/python.py
and /opt/Modules/v4.3.0/libexec/modulecmd.tcl
.
On Raijin, the script /etc/profile.d/nf_sh_modules
is run on login, and this calls $modules_path/init/bash
, setting up the environment variables. On Gadi, this script does not exist, but there is a script /etc/profile.d/modules.sh
which should do something similar. I will check.
Looging in to Gadi, I see:
Last login: Mon Nov 18 14:07:27 2019 from 150.203.248.245
[pcl900@gadi-login-01 ~]$ echo $MODULESHOME
/opt/Modules/v4.3.0
[pcl900@gadi-login-01 ~]$ echo $MODULEPATH
/apps/Modules/restricted-modulefiles/z00:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles
[pcl900@gadi-login-01 ~]$ echo $MODULE_VERSION
That is MODULEPATH
and MODULESHOME
are defined in the interactive login shell, but not MODULE_VERSION
.
I succeeded in starting the model by setting:
qsub_flags: -v MODULE_VERSION=v4.3.0,MODULEPATH=/g/data3/hh5/public/modules:/apps/Modules/restricted-modulefiles/z00:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulef
iles:/apps/Modules/modulefiles -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027
in config.yaml
. The following also works, as long as MODULE_VERSION
and MODULEPATH
are defined in the interactive shell:
qsub_flags: -V -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027
See /scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/config.yaml
(FYI, but not relevant to payu: I also copied /g/data1/ua8/JRA55-do/RYF
to /scratch/fp0/pcl900/
and adjusted atmosphere/forcing.json
to match, as well as setting ncpus
for ocean
back to 216 in config.yaml
. The model now fails with SIGSEGV
in ice_transport_remap.f90
.)
This fix only worked for payu-run
. With payu run
it fails as payu
only passes a limited environment to the PBS job, and so none of the module environment variables were being passed in.
Paul's work-around
qsub_flags: -V
ensures the environment is fully populated, so it works in that case.
This needs a another fix to properly populate the PBS environment with module environment variables.
Today I tried running on Gadi. Building succeeded. Unfortunately, I was unable to run the 1deg_jra55_ryf experiment. The error in payu/envmod.py line 40 is
This error is essentially because payu is written for Environment Modules version 3.2.6 on Rajin, and the version of Environment Modules on Gadi is 4.3.0. This version of Environment Modules is backwards incompatible with version 3.2.6. See https://modules.readthedocs.io/en/latest/diff_v3_v4.html Therefore payu needs to be changed to be compatible with Environment Modules version 4, in particular the current configuration on Gadi. It may also be possible that the configuration of Environment Modules on Gadi could change.
See also pull request #128 and issue #200.