Closed blimlim closed 3 months ago
A few more details on the existing restart deletion:
The iced.YYYYMMDD
file deletion under the if not self.split_paths:
is being called during ESM1.5 simulations. What I didn't notice before is that it's collecting files to delete using get_prior_restart_files()
. I.e. it's deleting restart files produced by the previous run. I think cice in ESM1.5 copies the previous run's restart files to self.work_restart_path
, but also writes the new restart files during the simulation to the same directory, and so the previous run's restart files need to be deleted from this directory before archiving.
I suspect we might be able to delete both the previous run's restart files plus the current run's excess monthly iced...
restart files by replacing
https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/cice.py#L297-L301
with something like
for f in os.listdir(self.restart_path):
if f.startswith('iced.'):
if f == res_name:
continue
os.remove(os.path.join(self.restart_path, f))
I believe the if not self.split_paths
condition will always be true for ESM1.5 simulations, and so if we modify this part of the code, it should always be called. The self.spit_paths
condition it originates from
The cice_in.nml
namelist file in our configurations doesn't contain a input_dir
field, meaning payu assigns sets the init_path
to equal the res_path
. Following along the rest of the logic this results in self.split_paths
being false.
I think it should be reasonably safe to add in this change. The reason I'm not 100% sure though is that it looks like something else is influencing the copying/deletion of files in the restart directory.
If we start a run with the following files in restart000/ice
:
cice_in.nml (namelist always there for timing)
input_ice.nml (namelist always there for timing)
ice.restart_file (The real restart pointer text file. Contains text: iced.01010201)
iced.01010201 (The real binary restart file)
mice.nc (The real mice.nc file)
ice.restart_file-01001231 (The restart pointer text file from the initial run. Contains text: iced.01010101)
mice.nc-01001231 (The mice restart file from the initial run)
ice.restart_file-fakeabc (A fake pointer text file. Contains text: iced.fakeabc)
mice.nc-fakeabc (A fake mice.nc file - empty file)
iced.fakeabc (A fake ice restart file - empty file)
ice.restart_file-123 (A fake pointer text file which is empty)
mice.nc-123 (A fake mice.nc file - empty file)
ice.restart_file-01234567 (A fake pointer text file. Contains text: iced.01234567)
mice.nc-01234567 (A fake mice.nc file - empty file)
Then the work/ice/RESTART
directory contains the following during the simulation:
cice_in.nml
input_ice.nml
ice.restart_file
mice.nc
iced.01010201
ice.restart_file-01001231
mice.nc-01001231
ice.restart_file-fakeabc
mice.nc-fakeabc
iced.fakeabc
ice.restart_file-123
mice.nc-123
ice.restart_file-01234567
mice.nc-01234567
o2i.nc
And at the end of the run, the new restart directory restart001/ice
contains
cice_in.nml
input_ice.nml
ice.restart_file
iced.01010301
mice.nc
ice.restart_file-01001231
mice.nc-01001231
ice.restart_file-fakeabc
ice.restart_file-01234567
I'm struggling to work out how this happens/what sort of logic can produce these results. I haven't been able to find anything in payu which would control the copying/deletion of these files. If I also add in a fake iced
file named iced.fakeabc
, the model tries to read it and crashes with the error
forrtl: severe (24): end-of-file during read, unit 15, file /scratch/tm70/sw6175/access-esm/work/more-restart-tests-more-restart-tests-6142a282/ice/./RESTART/./iced.fakeabc
Could some of the file deletion be occurring in the cice model itself?
Did we consider configuring CICE to produce less restarts? There is a dump_last option which we could set to true and then it would only write it at the end of the run. If we want to write these extra restarts so we can restart more easily in case of a crash then what you suggest to delete them look good. Its pretty low risk ... there is a different CICE5 driver and OM3 (i.e. CICE6) is built differently so it uses the cesm_cmeps driver.
It would be great to produce just the single restart at the end of the run. I think people usually run ESM1.5 in one year segments and so I don't think it would be a problem to just keep a single restart at the end of each run. I haven't been able to find a dump_last
option in the CICE4 repo https://github.com/ACCESS-NRI/cice4/blob/access-esm1.5/source/ice_init.F90, I'm wondering if that was another improvement added to CICE5?
I guess another option if we want to avoid payu changes would be to swap it from writing monthly restarts to yearly ones, though that would prevent running in monthly segments – I'm not sure how many people do that, but I sometimes do for testing things out.
Apologies - Maybe leaving it on monthly is a good idea then, and lets go with the change in payu ?
No worries, it would have been perfect if the dump_last option was available! I'll test out these changes to payu.
If its a priority to do "something" for esm1.5, then the fastest is just to set the config to yearly restarts and document that the run length needs changing in two places. Lets do the payu change, but we don't want it to hold up esm1.5 release.
Would it be simpler to specify different INPUT
and RESTART
directories, so split_paths
? If so is that a use-case we want to support?
Update: Seems I added it to cice5
so it isn't available for ACCESS-ESM1.5, but might be used in OM2, so be mindful of that when making changes.
https://github.com/ACCESS-NRI/cice5/commit/465494bb551ec15ec3ec82308359e6c7d3ae28a5
Ah ok, I hadn't realised the cice.py
driver was running with CIC5, though that makes sense though. In that case would it be safest to run the extra deletion through the access.py
driver, so that there's no risk of impacting CICE5/OM2?
In that case would it be safest to run the extra deletion through the
access.py
driver, so that there's no risk of impacting CICE5/OM2?
I suppose so. Or we could make a CICE4 driver if this is a CICE4 issue with not being able to sanely reduce the number of restarts.
I have a prototype of this running from the access.py
driver in this branch. I can add a pull request for it if this seems like a reasonable approach.
I suppose so. Or we could make a CICE4 driver if this is a CICE4 issue with not being able to sanely reduce the number of restarts.
We can set dumpfreq
in cice_in.nml to reduce the number of restarts produced.
Sorry - one more question @blimlim about changing payu - what happens if the user does want the extra restarts? How do we allow that?
Sorry - one more question @blimlim about changing payu - what happens if the user does want the extra restarts? How do we allow that?
Good question. I'm wondering whether the extra ice restarts would ever be usable, because there won't be any corresponding atmosphere or ocean restarts?
Sorry - one more question @blimlim about changing payu - what happens if the user does want the extra restarts? How do we allow that?
Good question. I'm wondering whether the extra ice restarts would ever be usable, because there won't be any corresponding atmosphere or ocean restarts?
Presumable they could be turned on at matching frequency?
I will make an issue to reduce the cice restart output frequency in https://github.com/ACCESS-NRI/access-esm1.5-configs
I will make an issue to reduce the cice restart output frequency in https://github.com/ACCESS-NRI/access-esm1.5-configs
That sounds good! I'll close this issue as we'll be adjusting the frequency via the CICE namelists rather than changing payu.
In ESM1.5 simulations, cice currently produces an
iced
restart file every month which then gets copied over to the archive.Only the latest one is useable, as the atmosphere and ocean only keep their restarts from the end of a run.
For the UM, the
um.py
driver currently culls its monthly restarts during thearchive
step https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/um.py#L92-L107It looks like something similar is included in
cice.py
:https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/cice.py#L289-L301
however this part mustn't be running for ESM1.5 simulations. It would be good to get this working for ESM1.5, however I'm finding the logic in setting up the
self.split_paths
condition a bit difficult to understand. I'm concerned about making changes but inadvertently breaking other configurations – just wondering if anyone has any knowledge/ideas about what the safest way to implement this would be?