payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Year 400 crashes in ESM1.5 simulations #466

Closed blimlim closed 4 days ago

blimlim commented 1 month ago

This relates to discussion in #457.

Background:

Researchers have had problems with long ESM1.5 paleo simulationscrashing in calendar year 400. See here and here. In the examples, the coupler and sea ice model appear to think there are only 365 days in the year while the ocean and atmosphere use the correct 366 days, leading to the crash, and in the first example, the ice model thinks its at year 300.

The disagreement between the ice/namcouple and the other components doesn't occur in every simulation that reaches year 400 though. E.g. attempting to reproduce the error by branching from the CSIRO pre-industrial run at year 400, and setting it to run for 3 months, payu gives the namcouple file the correct leap year run length of 91 days.

After working through some of Himadri's simulations, it looks like the calendar mismatch comes from payu's start date calculation for the cice submodel. It pulls in information from both the control directory and the restart directory, meaning that if you copy a restart directory across different experiments, you could end up with inconsistent start dates. I still find it a bit confusing, so I hope the following explanation makes some sense.

How payu sets the cice and namcouple run lengths/start dates:

The cice start date and run length are (mostly) calculated in the access.py driver. It uses the <CONTROL-DIRECTORY>/ice/input_ice.nml namelist, which for example in our pre-industrial configuration looks like

<CONTROL-DIRECTORY>/ice/input_ice.nml
---------------------------------------------
&coupling
 *** caltype=1 ***
 jobnum=2
 inidate=01010101
 *** init_date=00010101 ***
 runtime0=3155673600
 runtime=86400
...

in addition to <RESTART-DIRECTORY>/ice/input_ice.nml, which for the pre-industrial configuration looks like

<RESTART-DIRECTORY>/ice/input_ice.nml
---------------------------------------------
&coupling
 *** runtime0=3155673600 ***
 *** runtime=0 ***

I've highlighted the variables that access.py uses in for the calculation with the * symbols. The other variables are ignored.

To set the start-date, it adds a total simulation length of runtime0+runtime seconds from <RESTART-DIRECTORY>/ice/input_ice.nml to the init_date from <CONTROL-DIRECTORY>/ice/input_ice.nml:

https://github.com/payu-org/payu/blob/e9bd1f4c4adc223e818c71f8d45c36de98025dc4/payu/models/access.py#L138-L142

To calculate the run duration for the next experiment, it then uses this start date, the caltype value from <CONTROL-DIRECTORY>/ice/input_ice.nml, and the runtime settings in the config.yaml file:

https://github.com/payu-org/payu/blob/e9bd1f4c4adc223e818c71f8d45c36de98025dc4/payu/models/access.py#L151-L157

The resulting runtime then gets used by cice and the coupler.

How problems can arise

Because the calculation uses the init_date from the control directory, copying a restart directory between different experiments can lead to different cice start dates (and hence run times) if the <CONTROL-DIRECTORY>/ice/input_ice.nml files don't match. This can come up when using the warm-start.sh scripts to create a new restart directory based on a CSIRO simulation, which was done for some of the linked examples.

The warm-start.sh scripts modify the cice start date by setting init_date in <CONTROL-DIRECTORY>/ice/input_ice.nml to the desired start date (for example 01010101), and runtime0=0, runtime=0 in <RESTART-DIRECTORY>/ice/input_ice.nml. Running the resulting configuration will then start the ice calendar at 0101-01-01.

If you then start a new experiment by cloning e.g. the pre-industrial simulation, and copying over the restart directory already created by the warm-start.sh scripts, the new control directory will still have the unmodified init_date=00010101, while the restart folder will have the modified runtime0=0, runtime=0, and cice will use a start date of 0001-01-01, 100 years off where it's meant to be.

Meanwhile, the UM and MOM have their start dates given completely in the restart directory via the um.res.yaml and ocean_solo.res file, and so they'll use the correct date of 0101-01-01 (there are some caveats for the UM in other situations). Once the simulation gets to year 400 the mismatch and crash can then occur.

Possible changes

A couple of ideas listed below:

  1. Consistency checks. During the setup, payu could compare the start dates and run lengths for each component and produce an error if they aren't the same.
  2. All CICE start date settings moved to the restart directory. Conceptually should a restart directory contain all the start date information, rather than half being in the control directory? E.g. we could move the init_date settings into the namelists in the restart directory and make corresponding changes to payu.

It would be great to get any other ideas/opinions on possible changes!

blimlim commented 1 month ago

An option that @aidanheerdegen brought up is to try cut the current complicated timing calculations, and replace them with a completely new setup that uses an "artificial" calendar file that goes in the restart directory, say cice.res.yaml. This would mirror the behaviour of the UM time controls, which use the um.res.yaml file (usually - we might need to think about this more eventually).

The overall steps for setting cice's start date and run length could be:

  1. Read the start date from <prior-restart>/ice/cice.res.yaml.
  2. Read the run length from config.yaml.
  3. Calculate the run length in seconds using: the above start date, the above run length, the caltype setting in <control-dir>/ice/input_ice.nml
  4. Copy <control-dir>/ice/input_ice.nml and <control-dir>/ice/cice_in.nml into <work-directory>/ice and write the newly calculated start date and run length into the copies ...
  5. At the end of the run, write the start date + run length into a new cice.res.yaml file in the new restart directory.

This means that we would no longer need copies of the input_ice.nml and cice_in.nml files in the restart directory, and that all the information for setting the start date would live in the restart directory rather than being spread across the restart and control directories.

Does this sound like a reasonable overall approach? I'm happy to try and start implementing this, however there are a few implementation details/rough patches I'm anticipating will come up.


Implementation details/concerns:

1. Can we assert in access.py that the experiment must have a prior restart directory (either as a restartXYZ directory in the archive directory or a restart path in the config.yaml)? From what I can understand, ESM1.5 will crash if it can't find the restart files, however there are a couple of niche situations where payu will allow a restart directory to be absent.

Based on what I understand of the following:

https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/experiment.py#L360-L376

payu will let ESM try to run without a prior_restart_dir in the following two cases:

Trying to run in these situations, the model crashes because of the missing restarts. Are there any niche ESM situations where either of the above two situations need to be supported/allowed? Or is it safe to assert in the access.py driver that a prior_restart_path must exist?


2. Currently two separate timing calculations occur in different parts of payu: one in the access.py driver and the second (mostly but not completely ignored) one in the cice.py driver. Details about the differences between the two are discussed here. I think it would be good to simplify this so that for ESM1.5 experiments, only one calculation occurs using the artificial restart date file.

Would we prefer this calculation to occur in the access.py driver (where the main calculation currently is), or is it better to move it into the cice.py driver?

I suspect we need to keep the second calculation in the cice.py driver in case of standalone cice simulations (I'm not 100% on this – is this something that the cice driver is set up to do?), and so we'll need to add in some logic about which calculation to use in the different situations.


3/. In the standalone cice case, where timing calculations are completely controlled by the cice.py driver: do we want to refactor this to also use a new cice.res.yaml file for the start date?

I think this would be nice to do and would help make everything simpler, however may involve some work in understanding the different situations that the current calculation and logic paths are set up for, as it would be bad to accidentally break any of them...


4. Consistency checks. I think the easiest way to check consistency between the different model start dates would be to compare each of their restart date text/yaml files (um.res.yaml, <NEW>cice.res.yaml, ocean_solo.res). Eventually we might want to add internal consistency checks for cice. The binary iced.YYYYMMDD restart files contain a timestamp, and if this doesn't match the time set in the model namelists, cice will still run but with the time all messed up.

Similar internal consistency checks might also be useful for the UM, as the actual restart dump restart_dump.astart contains date information as well as the um.res.yaml calendar file. I haven't tested what happens if they don't match yet.


5. Caltype. If we swap to an artificial restart date text or yaml file, the ice model could still get the timing wrong if the caltype setting in input_ice.nml somehow got changed. I think this is unlikely, but we could guard against by requiring in the access.py driver that it is Gregorian for ESM1.5 experiments – or could that be too inflexible/possibly impact future configurations.


6. If we set ice model start date using a text file in the restart directory, it looks like there are some specific situations where payu will overwrite the prior_restart_path, and hence the timing calculations could be affected.

In the cice.py driver, the cice model's prior_restart_path is overwritten if the restart_dir setting in <control-directory>/ice/cice_in.nml happens to be absolute, or if it's not absolute, the restart_dir setting joined with any of the cice input paths is a real directory (I think).

https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/cice.py#L132-L146

If this happened, the timing calculation might be impacted. Since the current calculations also use data in the restart directory, I don't think this would be a new issue though.


7. This change will be incompatible with the warm-start.sh scripts people use to modify experiment start dates. Are we hoping to release an NRI supported version of these scripts eventually?

blimlim commented 1 month ago

A much simpler suggestion from @anton-seaice: Rather than reading the init_date from <control-dir>/ice/input_ice.nml, make sure this field is saved to <restart-dir>/ice/input_ice.nml. This would prevent the mixing of information from the control and restart directories. I think it would also be easier to implement – it doesn't create any new artificial files and should (I think) be able to work with most of the existing timing calculations in the drivers... hopefully reducing the risk of messing up anything that's already there.

I'll have a go at implementing this and seeing how it goes.

anton-seaice commented 1 month ago

I wondered if it will work ok with the existing warm-start scripts too?

I believe <restart-dir>/ice/input_ice.nml is only used by payu as a reference for changing a couple of fields in <control-dir>/ice/input_ice.nml? So adding a field here doesn't affect anything except how payu handles it. Ideally the models would just read this date from the restart files rather than needing an extra configuration item.

blimlim commented 1 month ago

I've started putting this together in this feature branch (https://github.com/ACCESS-NRI/payu/tree/466-esm1p5-cice-startdate-fix) and will do pull requests to that in smaller steps.

The first step has been to rename the <restart_dir>/ice/input_ice.nml file to <restart_dir>/ice/restart_date.nml (happy to use other names), and move the experiment initialisation date init_date setting to it:

<restart_dir>/ice/restart_date.nml
----------------------------------
&coupling
init_date=10101
runtime0=3155673600
runtime=0
/

I've then modified payu to read init_date from this file rather than <work_dir>/ice/input_ice.nml, so that gets all the information for setting the start date from the restart directory. All the calculations in payu otherwise remain exactly the same.

This method seems to work and has a couple of benefits:

A rough version of this option is available here: https://github.com/ACCESS-NRI/payu/tree/466b-cice-move-init_date


An alternative is to modify the setup so that <restart_dir>/ice/restart_date.nml contains

&coupling
init_date=10101
inidate=1010101
/

And payu can just read the new start date from inidate. This makes it more transparent what the start date for the new simulation actually is, and makes it easier to set up new restart directories. The calculation of runtime0 for use with the next simulation (i.e. the time between the init_date and inidate) can then be handled completely internally by payu.

I think a couple of downsides to this approach are:

A rough version of this option is available here: https://github.com/ACCESS-NRI/payu/tree/466a-cice-start-noruntime0


It would be great to hear what everyone thinks of these approaches, and whether you have a preference between the two.

In either case, I think the next steps would be to pull the code which reads/calculates the start date into a seperate method which could then be used to check whether the start dates are consistent between the submodels.

blimlim commented 4 days ago

Closing following #484, and further work to be done in #495