payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Branching OM2 from existing restart cannot find `cice_in.nml` #499

Open blimlim opened 1 week ago

blimlim commented 1 week ago

Branching an OM2 experiment from a previous simulation's restart produces an error at the payu setup stage. Following the instructions here, i.e.

gh repo clone ACCESS-Community-Hub/access-om2-1deg_jra55_ryf-example
cd access-om2-1deg_jra55_ryf-example
payu checkout -r /g/data/nf33/public/training-day-2024/payu-training/experiments/20240827-release-preindustrial+concentrations-run-0225dcf2/restart020 -b perturb1 0f2e2bb

followed by

payu setup

leads to the following error:

...
Setting up atmosphere
Setting up ocean
Setting up ice
payu: error: Cannot find prior namelist cice_in.nml

This error is raised in the cice.py driver: https://github.com/payu-org/payu/blob/2826621a83bcb13315d3a003a946c5ca8069b1cd/payu/models/cice.py#L210-L219

cice.py first searches the restart directory and then the previous output directory for a cice_in.nml file. OM2 simulations don't make a copy of cice_in.nml in the restart directories, and hence when branching from a separate restart, payu won't find this file.


A quick fix might be to edit the cice5.py driver so that cice_in.nml is always copied to the restart directory. The access.py driver already does this for ESM1.5 , so we might be able to do this step for both OM2 and ESM1.5 directly from cice.py.

However the output directory/restart directory version of cice_in.nml might not be necessary for OM2. Payu uses it to calculate a run length here, however I think this uses a different timestep to the one actually used by the model. E.g. I'd run a 5 month simulation a while ago and the ice output directory contains:

cd output000/ice
grep npt *
cice_in.nml:    npt = 35040.0 (from the payu cice.py calculation)
ice_diag.d:  npt                       =     2416 (Equals 151 days assuming 1.5h timestep as specified in accessom2.nml)

If the information from the output directory cice_in.nml isn't used, I'm wondering if we should stop reading it at all for OM2 simulations? We'd just need to think about the best way to handle ESM1.5 which does read it from the restart directory (but perhaps also might not need to?).

anton-seaice commented 1 week ago

It looks like the restart time is read from the restart file in CICE5:

https://github.com/ACCESS-NRI/cice5/blob/ca5a71cd89151be43ec9c7376dddb1efa347be5b/io_pio/ice_restart.F90#L84

Do we need this total_runtime variable for other uses ?

aidanheerdegen commented 1 week ago

In the first instance anyone using OM2 who encounters this problem should either use an older version of payu?

This points out a lack of testing coverage. When trying to fix this we need to add test cases that fail and fix them with the PR.

blimlim commented 1 week ago

Do we need this total_runtime variable for other uses ?

It looks like it's not being used for OM2. The CICE driver uses it to set istep0 in the work directory cice_in.nml namelist, but as you noted this is getting replaced using the restart file's date:

grep istep0 output001/*
cice_in.nml:    istep0 = 35040
ice_diag.d:  istep0                    =     2416

I think CICE is recalculating istep0 based on the accessom2.nml file here: https://github.com/ACCESS-NRI/cice5/blob/ca5a71cd89151be43ec9c7376dddb1efa347be5b/source/ice_init.F90#L466-L470

Because the runtime isn't set in the config.yaml, each run, payu just increments the value of istep0 by the value of npt each run. This doesn't match up with the number of timesteps of the actual run anyway, at least for the 1deg_jra55 config (here npt = 35040 while the number of timesteps for the default 5 year run time would be around 29200) so am pretty sure the value payu calculates for OM2 is meaningless.

In the first instance anyone using OM2 who encounters this problem should either use an older version of payu?

I think unfortunately the same problem will happen with older versions of payu too. Since the cice_in.nml file is getting moved into the output instead of the restart directory after OM2 runs, this file would have been missing whenever branching from a different run's OM2 restart.

blimlim commented 1 week ago

Let me know if the following change sounds reasonable:

This would mean OM2 would just use the cice_in.nml file directly from the control directory, and still use the timing information from the accessom2.nml file. Meanwhile the istep0 and npt calculations would still be used for ESM1.5 – we could probably investigate further whether they are actually needed for ESM1.5, but I think that might make sense to cover in a separate issue.

anton-seaice commented 1 week ago

Can we move the npt and istep0 calculation to a new function within the cice class ( e.g. _calc_runtime()) ? And then in the cice5 class just replace that function with a an empty function? Is that neater ?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Spencer Wong @.> Sent: Friday, September 6, 2024 12:04:05 PM To: payu-org/payu @.> Cc: Anton Steketee @.>; Comment @.> Subject: Re: [payu-org/payu] Branching OM2 from existing restart cannot find cice_in.nml (Issue #499)

Let me know if the following change sounds reasonable:

This would mean OM2 would just use the cice_in.nml file directly from the control directory, and still use the timing information from the accessom2.nml file. Meanwhile the istep0 and npt calculations would still be used for ESM1.5 – we could probably investigate further whether they are actually needed for ESM1.5, but I think that might make sense to cover in a separate issue.

— Reply to this email directly, view it on GitHubhttps://github.com/payu-org/payu/issues/499#issuecomment-2333045015, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS4DACAVMQDZ7F26NMOLT6TZVEERLAVCNFSM6AAAAABNQ5TQ7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZTGA2DKMBRGU. You are receiving this because you commented.Message ID: @.***>

blimlim commented 1 week ago

Can we move the npt and istep0 calculation to a new function within the cice class ( e.g. _calc_runtime()) ? And then in the cice5 class just replace that function with a an empty function? Is that neater ?

I think this would work too. My main concern would be with with quickly knowing what the code is doing when glancing over it. I.e. I think I'd easily make the mistake of only reading the cice.py driver and not realising that some of the methods were being overwritten by the cice5 driver – I guess this could be avoided with comments in the code though.

It's overkill for this issue, but I wonder if down the line it would be cleanest to have a seperate cice4 driver, so that anything needed by all versions of cice is done in the general cice.py driver, with version specific additions done in the cice4/5 drivers.

I'll have a go at adding in the new function and cice5 overwrite of it, and see how it looks!

anton-seaice commented 1 week ago

Thanks @blimlim - I started drafting some tests but didn't finish them. I can share them if its useful ?

blimlim commented 1 week ago

That would be great, thank you!

anton-seaice commented 1 week ago

See https://github.com/payu-org/payu/compare/iss499

I think the test_clone test is ok, but we need a test_restart_clone test too, presumably from an archive/restartXXX directory