Closed minghangli-uni closed 3 months ago
Thanks for opening this issue @minghangli-uni!
I've tried to reproduce this issue with changing the runspersub
setting in config.yaml
- using this test Mom6 configuration with payu/1.1.4
:
runspersub: 1
and payu run -n 3
to check it runs separate subsequent PBS run job submissionsrunspersub:10
and payu run -n 3
to check it runs 3 experiments runs in the same PBS run job submissionrunspersub:2
and payu run -n 3
to check it runs 2 experiment runs in first job submission and a third in a separate job submission.The above runs ok, so I am wondering it's something to do with model-specific logic in payu. @minghangli-uni Do you have a copy/link to the configuration you were testing?
Thanks @jo-basevi for looking into this! Could you please try this and see if it works for you?
~/g/data/tm70/ml0072/COMMON/git_repos/candelte/expt2
~
/g/data/tm70/ml0072/COMMON/git_repos/candelte/expt1
I'd be checking the model ran properly. It would help of you could change the PBS log files (expt2.o123041865
, expt2.e123041865
) to be group readable.
You can change this to be the default by adding
qsub_flags: -W umask=027
to your config.yaml
.
Thanks @aidanheerdegen! expt2
was a test run last week, which was only run once. I just quickly ran a new one called expt1
to reproduce the above bug I met. Hope now the log files are readable now.
/g/data/tm70/ml0072/COMMON/git_repos/candelte/expt1
Your archive scripts are erroring because the network is unreachable from compute nodes
Downloading data from 'https://raw.githubusercontent.com/ACCESS-NRI/schema/e9055da95093ec2faa555c090fc5af17923d1566/au.org.access-nri/model/output/file-metadata/1-0-1.js
on' to file '/home/563/ml0072/.cache/pooch/8e3c08344f0361af426ae185c86d446e-1-0-1.json'.
Traceback (most recent call last):
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn
sock = connection.create_connection(
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable
Thanks @aidanheerdegen. It's not a Payu bug. I will close it.
@anton-seaice It seems that the error is due to the user script failure,
RuntimeError: User defined script/command failed to run: /usr/bin/bash /g/data/vk83/apps/om3-scripts/payu_config/archive.sh
. It aborts hence cannot do the successive runs.
Ah yes! We have seen that before ... @dougiesquire and I have cached copies of that schema so we don't get the error:
Thoughts @dougiesquire ? I guess we need an additional script which runs on the login node to cache this file ? Do any of the payu user script steps run on the login node ?
Ah rats. I'd forgot about this. The first time a user imports the access-nri-intake
package, the schema is downloaded and cached. The workaround for now @minghangli-uni is to do the following from the login node:
$ module use /g/data/hh5/public/modules
$ module load conda/analysis3
$ python
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from access_nri_intake import source
>>> exit()
This will cache a copy of the schema so you'll only need to do it once (until the schema changes). But we need a better solution.
The options I see are:
Do any of the payu user script steps run on the login node ?
During a payu run
- the init
,setup
, run
, archive
and error
user scripts run on the run PBS job. If syncing, the sync
user script runs on the separate sync job. Could have a setup
user script, but then that would require users running payu setup
command on the login node once, before running a payu run
.
I think that means the only way using a user script would be to trigger qsub job on the copyq
?
BTW I am a big fan of the intake catalogue and the idea of automatically generating a catalogue so users can create a notebook that works with a local catalogue and then works seamlessly when/if the data is included in another catalogue at a later date.
Translation: it is worth taking some time to get this right, as I think we want this functionality more broadly across ACCESS model configurations.
Well look into including the schema with the intake package , see https://github.com/ACCESS-NRI/access-nri-intake-catalog/issues/185
For
payu/1.1.4
orpayu/dev
, usingpayu run -n N
results in the experiment running only once instead ofN
times. This issue exists with the following loaded modules.or
However,
payu/1.1.3
functions as expected.