payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Inconsistent state when experiment_uuid deleted #436

Closed aidanheerdegen closed 2 months ago

aidanheerdegen commented 2 months ago

I managed to get a payu control directory into an inconsistent state by deleting the experiment_uuid from metadata.yaml and then doing payu setup.

payu version is 431-Recover-from-incomplete-checkout branch from PR #435

Details I cloned a standard release config: ``` $ ~/.local/bin/payu clone -B release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs.git 1deg_jra55_ryf Cloned repository from https://github.com/ACCESS-NRI/access-om2-configs.git to directory: /tmp/lala/1deg_jra55_ryf Checked out branch: release-1deg_jra55_ryf laboratory path: /scratch/tm70/aph502/access-om2 binary path: /scratch/tm70/aph502/access-om2/bin input path: /scratch/tm70/aph502/access-om2/input work path: /scratch/tm70/aph502/access-om2/work archive path: /scratch/tm70/aph502/access-om2/archive Updated metadata. Experiment UUID: 357368bb-2c6f-4b33-a7d5-38a8a2d6ab69 Added archive symlink to /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-357368bb To change directory to control directory run: cd 1deg_jra55_ryf $ cd 1deg_jra55_ryf/ ``` So it has a generated a new UUID ``` $ ~/.local/bin/payu branch * Current Branch: release-1deg_jra55_ryf experiment_uuid: 357368bb-2c6f-4b33-a7d5-38a8a2d6ab69 $ ls -ld archive lrwxrwxrwx 1 aph502 tm70 86 Apr 9 22:32 archive -> /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-357368bb ``` At this stage it looks fine. I then removed the `experiment_uuid` from `metadata.yaml` and tried to run `payu setup`: ``` [aph502@gadi-login-02 1deg_jra55_ryf]$ payu setup laboratory path: /scratch/tm70/aph502/access-om2 binary path: /scratch/tm70/aph502/access-om2/bin input path: /scratch/tm70/aph502/access-om2/input work path: /scratch/tm70/aph502/access-om2/work archive path: /scratch/tm70/aph502/access-om2/archive payu: error: work path already exists: /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf. payu sweep and then payu run ``` So it now thinks it is a legacy experiment and finds an existing `work` directory. Force it to sweep and create a new `work directory`: ``` $ payu setup -f laboratory path: /scratch/tm70/aph502/access-om2 binary path: /scratch/tm70/aph502/access-om2/bin input path: /scratch/tm70/aph502/access-om2/input work path: /scratch/tm70/aph502/access-om2/work archive path: /scratch/tm70/aph502/access-om2/archive payu: work path already exists. Sweeping as --force option is True. Removing work path /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf Loading input manifest: manifests/input.yaml Loading restart manifest: manifests/restart.yaml Loading exe manifest: manifests/exe.yaml Setting up atmosphere Setting up ocean Setting up ice Setting up access-om2 Checking exe and input manifests File no longer in input directory: work/ice/RESTART/i2o.nc removing from manifest File no longer in input directory: work/ice/RESTART/monthly_sstsss.nc removing from manifest File no longer in input directory: work/ice/RESTART/u_star.nc removing from manifest File no longer in input directory: work/ice/RESTART/kmt.nc removing from manifest File no longer in input directory: work/ice/RESTART/grid.nc removing from manifest File no longer in input directory: work/ice/RESTART/o2i.nc removing from manifest Creating restart manifest Updating full hashes for 181 files in manifests/restart.yaml Writing manifests/input.yaml Writing manifests/restart.yaml ``` Now the `archive` is pointing at the same location from the beginning, but the `work` is pointing to the legacy location: ``` $ ls -l total 48 -rw-r----- 1 aph502 tm70 861 Apr 9 22:32 accessom2.nml lrwxrwxrwx 1 aph502 tm70 86 Apr 9 22:32 archive -> /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-357368bb drwxr-x--- 2 aph502 tm70 80 Apr 9 22:32 atmosphere -rw-r----- 1 aph502 tm70 5173 Apr 9 22:32 config.yaml drwxr-x--- 2 aph502 tm70 60 Apr 9 22:32 doc drwxr-x--- 2 aph502 tm70 120 Apr 9 22:32 ice -rw-r----- 1 aph502 tm70 18657 Apr 9 22:32 LICENSE drwxr-x--- 2 aph502 tm70 100 Apr 9 22:32 manifests -rw-r----- 1 aph502 tm70 2219 Apr 9 22:35 metadata.yaml -rw-r----- 1 aph502 tm70 7904 Apr 9 22:32 namcouple drwxr-x--- 2 aph502 tm70 120 Apr 9 22:32 ocean -rw-r----- 1 aph502 tm70 1367 Apr 9 22:32 README.md drwxr-x--- 3 aph502 tm70 60 Apr 9 22:32 testing drwxr-x--- 2 aph502 tm70 100 Apr 9 22:32 tools lrwxrwxrwx 1 aph502 tm70 51 Apr 9 22:35 work -> /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf ```

Now it might be answer is "don't do that", but it is quite a confusing situation to get into with a relatively small change.

jo-basevi commented 2 months ago

Do you know if ~/.local/bin/payu and the payu command were pointing to the same location? As its there a few lines that I would've expected to be logged out if it was the payu version: 431-Recover-from-incomplete-checkout.

E.g. in recent merged changes, if it finds an archive (legacy or not) it'll will always log out the archive path. e.g.

Found experiment archive: /scratch/tm70/jb4202/access-om2/archive/test_inconsistent_state-release-1deg_jra55_ryf-14667dab

or warning if there's no UUID in metadata - (below has UUID removed from metadata.yaml and a legacy archive exists with no corresponding metadata)

jb4202@gadi-login-03 ~/test_models/incomplete_checkout/test_inconsistent_state (release-1deg_jra55_ryf)$ payu setup
laboratory path:  /scratch/tm70/jb4202/access-om2
binary path:  /scratch/tm70/jb4202/access-om2/bin
input path:  /scratch/tm70/jb4202/access-om2/input
work path:  /scratch/tm70/jb4202/access-om2/work
archive path:  /scratch/tm70/jb4202/access-om2/archive
/home/189/jb4202/payu_fork/payu/metadata.py:125: MetadataWarning: No experiment uuid found in metadata. Generating a new uuid
  warnings.warn("No experiment uuid found in metadata. "
Found experiment archive: /scratch/tm70/jb4202/access-om2/archive/test_inconsistent_state
Updated metadata. Experiment UUID: a68e2ad3-1547-4b87-a95d-09163c51dcfd
payu: error: work path already exists: /scratch/tm70/jb4202/access-om2/work/test_inconsistent_state.
             payu sweep and then payu run

or subsequent setups $ payu setup

laboratory path:  /scratch/tm70/jb4202/access-om2
binary path:  /scratch/tm70/jb4202/access-om2/bin
input path:  /scratch/tm70/jb4202/access-om2/input
work path:  /scratch/tm70/jb4202/access-om2/work
archive path:  /scratch/tm70/jb4202/access-om2/archive
Found experiment archive: /scratch/tm70/jb4202/access-om2/archive/test_inconsistent_state
payu: error: work path already exists: /scratch/tm70/jb4202/access-om2/work/test_inconsistent_state.
             payu sweep and then payu run
aidanheerdegen commented 2 months ago

So sorry to have wasted your time. Yes you are correct, I was using inconsistent versions of payu when testing.

If I redo those commands but instead use ~/.local/bin/payu setup it works as you say, and I'd expect, and generates a new UUID and relinks archive and work to point at the new experiment directories

$ ~/.local/bin/payu setup
laboratory path:  /scratch/tm70/aph502/access-om2
binary path:  /scratch/tm70/aph502/access-om2/bin
input path:  /scratch/tm70/aph502/access-om2/input
work path:  /scratch/tm70/aph502/access-om2/work
archive path:  /scratch/tm70/aph502/access-om2/archive
/home/502/aph502/code/python/payu/payu/metadata.py:125: MetadataWarning: No experiment uuid found in metadata. Generating a new uuid
  warnings.warn("No experiment uuid found in metadata. "
Mismatch of UUIDs between metadata and an archive metadata found at: /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf/metadata.yaml
/home/502/aph502/code/python/payu/payu/metadata.py:180: MetadataWarning: No pre-existing archive found. Generating a new uuid
  warnings.warn(
Updated metadata. Experiment UUID: d05616b0-e65b-4d71-a3cc-d4672ccb5e17
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
Making exe links
Setting up atmosphere
Setting up ocean
Setting up ice
Setting up access-om2
Checking exe and input manifests
Creating restart manifest
Writing manifests/restart.yaml
$ ls -ld work archive
lrwxrwxrwx 1 aph502 tm70 86 Apr 10 16:05 archive -> /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-383acc4b
lrwxrwxrwx 1 aph502 tm70 83 Apr 10 16:05 work -> /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf-release-1deg_jra55_ryf-d05616b0

Thanks for looking into that and finding my mistake @jo-basevi. Glad it was a stuff up on my part and not a code issue.

jo-basevi commented 2 months ago

The archive symlink is still pointing towards an old archive. It will use the new archive and re-setup the symlink when archive runs as part of payu run or if payu checkout $SAME_BRANCH. Could add a line to payu setup to setup the archive symlink?