Closed blimlim closed 2 months ago
An option that @aidanheerdegen brought up is to try cut the current complicated timing calculations, and replace them with a completely new setup that uses an "artificial" calendar file that goes in the restart directory, say cice.res.yaml
. This would mirror the behaviour of the UM time controls, which use the um.res.yaml
file (usually - we might need to think about this more eventually).
The overall steps for setting cice's start date and run length could be:
<prior-restart>/ice/cice.res.yaml
.config.yaml
.caltype
setting in <control-dir>/ice/input_ice.nml
<control-dir>/ice/input_ice.nml
and <control-dir>/ice/cice_in.nml
into <work-directory>/ice
and write the newly calculated start date and run length into the copies
...cice.res.yaml
file in the new restart directory.This means that we would no longer need copies of the input_ice.nml
and cice_in.nml
files in the restart directory, and that all the information for setting the start date would live in the restart directory rather than being spread across the restart and control directories.
Does this sound like a reasonable overall approach? I'm happy to try and start implementing this, however there are a few implementation details/rough patches I'm anticipating will come up.
Implementation details/concerns:
1. Can we assert in access.py
that the experiment must have a prior restart directory (either as a restartXYZ
directory in the archive
directory or a restart
path in the config.yaml
)? From what I can understand, ESM1.5 will crash if it can't find the restart files, however there are a couple of niche situations where payu will allow a restart directory to be absent.
Based on what I understand of the following:
payu will let ESM try to run without a prior_restart_dir
in the following two cases:
repeat
setting in the config.yaml
, and B: no restart directory is specified in the config.yaml
self.counter
is 0
and B: no restart path is specified in the config.yaml
(and C: restart-01
does not exist in the archive/restart
directory).Trying to run in these situations, the model crashes because of the missing restarts. Are there any niche ESM situations where either of the above two situations need to be supported/allowed? Or is it safe to assert in the access.py
driver that a prior_restart_path
must exist?
2. Currently two separate timing calculations occur in different parts of payu: one in the access.py
driver and the second (mostly but not completely ignored) one in the cice.py
driver. Details about the differences between the two are discussed here. I think it would be good to simplify this so that for ESM1.5 experiments, only one calculation occurs using the artificial restart date file.
Would we prefer this calculation to occur in the access.py
driver (where the main calculation currently is), or is it better to move it into the cice.py
driver?
I suspect we need to keep the second calculation in the cice.py
driver in case of standalone cice simulations (I'm not 100% on this – is this something that the cice driver is set up to do?), and so we'll need to add in some logic about which calculation to use in the different situations.
3/. In the standalone cice case, where timing calculations are completely controlled by the cice.py
driver: do we want to refactor this to also use a new cice.res.yaml
file for the start date?
I think this would be nice to do and would help make everything simpler, however may involve some work in understanding the different situations that the current calculation and logic paths are set up for, as it would be bad to accidentally break any of them...
4. Consistency checks. I think the easiest way to check consistency between the different model start dates would be to compare each of their restart date text/yaml files (um.res.yaml
, <NEW>cice.res.yaml
, ocean_solo.res
). Eventually we might want to add internal consistency checks for cice. The binary iced.YYYYMMDD
restart files contain a timestamp, and if this doesn't match the time set in the model namelists, cice will still run but with the time all messed up.
Similar internal consistency checks might also be useful for the UM, as the actual restart dump restart_dump.astart
contains date information as well as the um.res.yaml
calendar file. I haven't tested what happens if they don't match yet.
5. Caltype. If we swap to an artificial restart date text or yaml file, the ice model could still get the timing wrong if the caltype
setting in input_ice.nml
somehow got changed. I think this is unlikely, but we could guard against by requiring in the access.py
driver that it is Gregorian for ESM1.5 experiments – or could that be too inflexible/possibly impact future configurations.
6. If we set ice model start date using a text file in the restart directory, it looks like there are some specific situations where payu will overwrite the prior_restart_path
, and hence the timing calculations could be affected.
In the cice.py
driver, the cice model's prior_restart_path
is overwritten if the restart_dir
setting in <control-directory>/ice/cice_in.nml
happens to be absolute, or if it's not absolute, the restart_dir
setting joined with any of the cice input paths is a real directory (I think).
If this happened, the timing calculation might be impacted. Since the current calculations also use data in the restart directory, I don't think this would be a new issue though.
7. This change will be incompatible with the warm-start.sh
scripts people use to modify experiment start dates. Are we hoping to release an NRI supported version of these scripts eventually?
A much simpler suggestion from @anton-seaice: Rather than reading the init_date
from <control-dir>/ice/input_ice.nml
, make sure this field is saved to <restart-dir>/ice/input_ice.nml
. This would prevent the mixing of information from the control and restart directories. I think it would also be easier to implement – it doesn't create any new artificial files and should (I think) be able to work with most of the existing timing calculations in the drivers... hopefully reducing the risk of messing up anything that's already there.
I'll have a go at implementing this and seeing how it goes.
I wondered if it will work ok with the existing warm-start scripts too?
I believe <restart-dir>/ice/input_ice.nml
is only used by payu as a reference for changing a couple of fields in <control-dir>/ice/input_ice.nml
? So adding a field here doesn't affect anything except how payu handles it. Ideally the models would just read this date from the restart files rather than needing an extra configuration item.
I've started putting this together in this feature branch (https://github.com/ACCESS-NRI/payu/tree/466-esm1p5-cice-startdate-fix) and will do pull requests to that in smaller steps.
The first step has been to rename the <restart_dir>/ice/input_ice.nml
file to <restart_dir>/ice/restart_date.nml
(happy to use other names), and move the experiment initialisation date init_date
setting to it:
<restart_dir>/ice/restart_date.nml
----------------------------------
&coupling
init_date=10101
runtime0=3155673600
runtime=0
/
I've then modified payu to read init_date
from this file rather than <work_dir>/ice/input_ice.nml
, so that gets all the information for setting the start date from the restart directory. All the calculations in payu otherwise remain exactly the same.
This method seems to work and has a couple of benefits:
init_date
, runtime0
and runtime
all mean to understand what the starting date is going to be. Also requires messing around with calendar calculations if you need to set up a new restart directory.A rough version of this option is available here: https://github.com/ACCESS-NRI/payu/tree/466b-cice-move-init_date
An alternative is to modify the setup so that <restart_dir>/ice/restart_date.nml
contains
&coupling
init_date=10101
inidate=1010101
/
And payu can just read the new start date from inidate
. This makes it more transparent what the start date for the new simulation actually is, and makes it easier to set up new restart directories. The calculation of runtime0
for use with the next simulation (i.e. the time between the init_date
and inidate
) can then be handled completely internally by payu.
I think a couple of downsides to this approach are:
A rough version of this option is available here: https://github.com/ACCESS-NRI/payu/tree/466a-cice-start-noruntime0
It would be great to hear what everyone thinks of these approaches, and whether you have a preference between the two.
In either case, I think the next steps would be to pull the code which reads/calculates the start date into a seperate method which could then be used to check whether the start dates are consistent between the submodels.
Closing following #484, and further work to be done in #495
This relates to discussion in #457.
Background:
Researchers have had problems with long ESM1.5 paleo simulationscrashing in calendar year 400. See here and here. In the examples, the coupler and sea ice model appear to think there are only 365 days in the year while the ocean and atmosphere use the correct 366 days, leading to the crash, and in the first example, the ice model thinks its at year 300.
The disagreement between the ice/namcouple and the other components doesn't occur in every simulation that reaches year 400 though. E.g. attempting to reproduce the error by branching from the CSIRO pre-industrial run at year 400, and setting it to run for 3 months, payu gives the
namcouple
file the correct leap year run length of 91 days.After working through some of Himadri's simulations, it looks like the calendar mismatch comes from payu's start date calculation for the cice submodel. It pulls in information from both the control directory and the restart directory, meaning that if you copy a restart directory across different experiments, you could end up with inconsistent start dates. I still find it a bit confusing, so I hope the following explanation makes some sense.
How payu sets the cice and namcouple run lengths/start dates:
The cice start date and run length are (mostly) calculated in the
access.py
driver. It uses the<CONTROL-DIRECTORY>/ice/input_ice.nml
namelist, which for example in our pre-industrial configuration looks likein addition to
<RESTART-DIRECTORY>/ice/input_ice.nml
, which for the pre-industrial configuration looks likeI've highlighted the variables that
access.py
uses in for the calculation with the*
symbols. The other variables are ignored.To set the start-date, it adds a total simulation length of
runtime0+runtime
seconds from<RESTART-DIRECTORY>/ice/input_ice.nml
to theinit_date
from<CONTROL-DIRECTORY>/ice/input_ice.nml
:https://github.com/payu-org/payu/blob/e9bd1f4c4adc223e818c71f8d45c36de98025dc4/payu/models/access.py#L138-L142
To calculate the run duration for the next experiment, it then uses this start date, the
caltype
value from<CONTROL-DIRECTORY>/ice/input_ice.nml
, and the runtime settings in theconfig.yaml
file:https://github.com/payu-org/payu/blob/e9bd1f4c4adc223e818c71f8d45c36de98025dc4/payu/models/access.py#L151-L157
The resulting runtime then gets used by cice and the coupler.
How problems can arise
Because the calculation uses the
init_date
from the control directory, copying a restart directory between different experiments can lead to different cice start dates (and hence run times) if the<CONTROL-DIRECTORY>/ice/input_ice.nml
files don't match. This can come up when using thewarm-start.sh
scripts to create a new restart directory based on a CSIRO simulation, which was done for some of the linked examples.The
warm-start.sh
scripts modify the cice start date by settinginit_date
in<CONTROL-DIRECTORY>/ice/input_ice.nml
to the desired start date (for example01010101
), andruntime0=0, runtime=0
in<RESTART-DIRECTORY>/ice/input_ice.nml
. Running the resulting configuration will then start the ice calendar at0101-01-01
.If you then start a new experiment by cloning e.g. the pre-industrial simulation, and copying over the restart directory already created by the
warm-start.sh
scripts, the new control directory will still have the unmodifiedinit_date=00010101
, while the restart folder will have the modifiedruntime0=0, runtime=0
, and cice will use a start date of0001-01-01
, 100 years off where it's meant to be.Meanwhile, the UM and MOM have their start dates given completely in the restart directory via the
um.res.yaml
andocean_solo.res
file, and so they'll use the correct date of0101-01-01
(there are some caveats for the UM in other situations). Once the simulation gets to year 400 the mismatch and crash can then occur.Possible changes
A couple of ideas listed below:
init_date
settings into the namelists in the restart directory and make corresponding changes to payu.It would be great to get any other ideas/opinions on possible changes!