payu/1.1.4 Successive runs fail

minghangli-uni commented 3 months ago

For payu/1.1.4 or payu/dev, using payu run -n N results in the experiment running only once instead of N times. This issue exists with the following loaded modules.

module use /g/data/vk83/modules
module load payu/1.1.4

or

module use /g/data/vk83/prerelease/modules
module load payu/dev

However, payu/1.1.3 functions as expected.

jo-basevi commented 3 months ago

Thanks for opening this issue @minghangli-uni!

I've tried to reproduce this issue with changing the runspersub setting in config.yaml - using this test Mom6 configuration with payu/1.1.4:

runspersub: 1 and payu run -n 3 to check it runs separate subsequent PBS run job submissions
runspersub:10 and payu run -n 3 to check it runs 3 experiments runs in the same PBS run job submission
runspersub:2 and payu run -n 3 to check it runs 2 experiment runs in first job submission and a third in a separate job submission.

The above runs ok, so I am wondering it's something to do with model-specific logic in payu. @minghangli-uni Do you have a copy/link to the configuration you were testing?

minghangli-uni commented 3 months ago

Thanks @jo-basevi for looking into this! Could you please try this and see if it works for you?

~/g/data/tm70/ml0072/COMMON/git_repos/candelte/expt2~ /g/data/tm70/ml0072/COMMON/git_repos/candelte/expt1

aidanheerdegen commented 3 months ago

I'd be checking the model ran properly. It would help of you could change the PBS log files (expt2.o123041865, expt2.e123041865) to be group readable.

You can change this to be the default by adding

qsub_flags: -W umask=027

to your config.yaml.

minghangli-uni commented 3 months ago

Thanks @aidanheerdegen! expt2 was a test run last week, which was only run once. I just quickly ran a new one called expt1 to reproduce the above bug I met. Hope now the log files are readable now.

/g/data/tm70/ml0072/COMMON/git_repos/candelte/expt1

aidanheerdegen commented 3 months ago

Your archive scripts are erroring because the network is unreachable from compute nodes

Downloading data from 'https://raw.githubusercontent.com/ACCESS-NRI/schema/e9055da95093ec2faa555c090fc5af17923d1566/au.org.access-nri/model/output/file-metadata/1-0-1.js
on' to file '/home/563/ml0072/.cache/pooch/8e3c08344f0361af426ae185c86d446e-1-0-1.json'.
Traceback (most recent call last):
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

minghangli-uni commented 3 months ago

Thanks @aidanheerdegen. It's not a Payu bug. I will close it.

@anton-seaice It seems that the error is due to the user script failure,

RuntimeError: User defined script/command failed to run: /usr/bin/bash /g/data/vk83/apps/om3-scripts/payu_config/archive.sh. It aborts hence cannot do the successive runs.

anton-seaice commented 3 months ago

Ah yes! We have seen that before ... @dougiesquire and I have cached copies of that schema so we don't get the error:

Thoughts @dougiesquire ? I guess we need an additional script which runs on the login node to cache this file ? Do any of the payu user script steps run on the login node ?

dougiesquire commented 3 months ago

Ah rats. I'd forgot about this. The first time a user imports the access-nri-intake package, the schema is downloaded and cached. The workaround for now @minghangli-uni is to do the following from the login node:

$ module use /g/data/hh5/public/modules
$ module load conda/analysis3
$ python
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from access_nri_intake import source
>>> exit()

This will cache a copy of the schema so you'll only need to do it once (until the schema changes). But we need a better solution.

anton-seaice commented 3 months ago

The options I see are:

Change / add a user script to get this schema from a queue with network access.
Bundle the schema within the access-nri-intake package

jo-basevi commented 3 months ago

Do any of the payu user script steps run on the login node ?

During a payu run - the init,setup, run, archive and error user scripts run on the run PBS job. If syncing, the sync user script runs on the separate sync job. Could have a setup user script, but then that would require users running payu setup command on the login node once, before running a payu run.

anton-seaice commented 3 months ago

I think that means the only way using a user script would be to trigger qsub job on the copyq ?

aidanheerdegen commented 3 months ago

BTW I am a big fan of the intake catalogue and the idea of automatically generating a catalogue so users can create a notebook that works with a local catalogue and then works seamlessly when/if the data is included in another catalogue at a later date.

Translation: it is worth taking some time to get this right, as I think we want this functionality more broadly across ACCESS model configurations.

anton-seaice commented 3 months ago

Well look into including the schema with the intake package , see https://github.com/ACCESS-NRI/access-nri-intake-catalog/issues/185

payu-org / payu

payu/1.1.4 Successive runs fail #487