payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Excess ice restart files in ESM1.5 simulations #471

Closed blimlim closed 2 months ago

blimlim commented 2 months ago

In ESM1.5 simulations, cice currently produces an iced restart file every month which then gets copied over to the archive.

restart000/ice/
iced.01010301  iced.01010501  iced.01010701  iced.01010901  iced.01011101  iced.01020101     ice.restart_file-01001231  mice.nc
iced.01010201  iced.01010401  iced.01010601  iced.01010801  iced.01011001  iced.01011201  ice.restart_file  mice.nc-01001231

Only the latest one is useable, as the atmosphere and ocean only keep their restarts from the end of a run.

For the UM, the um.py driver currently culls its monthly restarts during the archive step https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/um.py#L92-L107

It looks like something similar is included in cice.py:

https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/cice.py#L289-L301

however this part mustn't be running for ESM1.5 simulations. It would be good to get this working for ESM1.5, however I'm finding the logic in setting up the self.split_paths condition a bit difficult to understand. I'm concerned about making changes but inadvertently breaking other configurations – just wondering if anyone has any knowledge/ideas about what the safest way to implement this would be?

blimlim commented 2 months ago

A few more details on the existing restart deletion:

The iced.YYYYMMDD file deletion under the if not self.split_paths: is being called during ESM1.5 simulations. What I didn't notice before is that it's collecting files to delete using get_prior_restart_files(). I.e. it's deleting restart files produced by the previous run. I think cice in ESM1.5 copies the previous run's restart files to self.work_restart_path, but also writes the new restart files during the simulation to the same directory, and so the previous run's restart files need to be deleted from this directory before archiving.

I suspect we might be able to delete both the previous run's restart files plus the current run's excess monthly iced... restart files by replacing https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/cice.py#L297-L301

with something like

 for f in os.listdir(self.restart_path):
     if f.startswith('iced.'): 
         if f == res_name: 
             continue 
         os.remove(os.path.join(self.restart_path, f)) 

I believe the if not self.split_paths condition will always be true for ESM1.5 simulations, and so if we modify this part of the code, it should always be called. The self.spit_paths condition it originates from

https://github.com/payu-org/payu/blob/89d70cf74655b6870505340b51124e7836284f9b/payu/models/cice.py#L65-L70

The cice_in.nml namelist file in our configurations doesn't contain a input_dir field, meaning payu assigns sets the init_path to equal the res_path. Following along the rest of the logic this results in self.split_paths being false.

I think it should be reasonably safe to add in this change. The reason I'm not 100% sure though is that it looks like something else is influencing the copying/deletion of files in the restart directory.

If we start a run with the following files in restart000/ice:

cice_in.nml  (namelist always there for timing)
input_ice.nml (namelist always there for timing)

ice.restart_file (The real restart pointer text file. Contains text: iced.01010201)                                                                                                                                                                                           
iced.01010201 (The real binary restart file)

mice.nc (The real mice.nc file)
ice.restart_file-01001231 (The restart pointer text file from the initial run. Contains text: iced.01010101)
mice.nc-01001231  (The mice restart file from the initial run)

ice.restart_file-fakeabc  (A fake pointer text file. Contains text:  iced.fakeabc)
mice.nc-fakeabc (A fake mice.nc file - empty file)
iced.fakeabc (A fake ice restart file - empty file)

ice.restart_file-123  (A fake pointer text file which is empty)   
mice.nc-123 (A fake mice.nc file - empty file)

ice.restart_file-01234567 (A fake pointer text file. Contains text: iced.01234567)
mice.nc-01234567 (A fake mice.nc file - empty file)

Then the work/ice/RESTART directory contains the following during the simulation:

cice_in.nml   
input_ice.nml

ice.restart_file 
mice.nc
iced.01010201 

ice.restart_file-01001231  
mice.nc-01001231  

ice.restart_file-fakeabc 
mice.nc-fakeabc
iced.fakeabc

ice.restart_file-123              
mice.nc-123       

ice.restart_file-01234567  
mice.nc-01234567  

o2i.nc

And at the end of the run, the new restart directory restart001/ice contains

cice_in.nml  
input_ice.nml  

ice.restart_file
iced.01010301  
mice.nc  

ice.restart_file-01001231  
mice.nc-01001231

ice.restart_file-fakeabc  

ice.restart_file-01234567  

I'm struggling to work out how this happens/what sort of logic can produce these results. I haven't been able to find anything in payu which would control the copying/deletion of these files. If I also add in a fake iced file named iced.fakeabc, the model tries to read it and crashes with the error

forrtl: severe (24): end-of-file during read, unit 15, file /scratch/tm70/sw6175/access-esm/work/more-restart-tests-more-restart-tests-6142a282/ice/./RESTART/./iced.fakeabc

Could some of the file deletion be occurring in the cice model itself?

anton-seaice commented 2 months ago

Did we consider configuring CICE to produce less restarts? There is a dump_last option which we could set to true and then it would only write it at the end of the run. If we want to write these extra restarts so we can restart more easily in case of a crash then what you suggest to delete them look good. Its pretty low risk ... there is a different CICE5 driver and OM3 (i.e. CICE6) is built differently so it uses the cesm_cmeps driver.

blimlim commented 2 months ago

It would be great to produce just the single restart at the end of the run. I think people usually run ESM1.5 in one year segments and so I don't think it would be a problem to just keep a single restart at the end of each run. I haven't been able to find a dump_last option in the CICE4 repo https://github.com/ACCESS-NRI/cice4/blob/access-esm1.5/source/ice_init.F90, I'm wondering if that was another improvement added to CICE5?

I guess another option if we want to avoid payu changes would be to swap it from writing monthly restarts to yearly ones, though that would prevent running in monthly segments – I'm not sure how many people do that, but I sometimes do for testing things out.

anton-seaice commented 2 months ago

Apologies - Maybe leaving it on monthly is a good idea then, and lets go with the change in payu ?

blimlim commented 2 months ago

No worries, it would have been perfect if the dump_last option was available! I'll test out these changes to payu.

anton-seaice commented 2 months ago

If its a priority to do "something" for esm1.5, then the fastest is just to set the config to yearly restarts and document that the run length needs changing in two places. Lets do the payu change, but we don't want it to hold up esm1.5 release.

aidanheerdegen commented 2 months ago

Would it be simpler to specify different INPUT and RESTART directories, so split_paths? If so is that a use-case we want to support?

Update: Seems I added it to cice5 so it isn't available for ACCESS-ESM1.5, but might be used in OM2, so be mindful of that when making changes.

https://github.com/ACCESS-NRI/cice5/commit/465494bb551ec15ec3ec82308359e6c7d3ae28a5

blimlim commented 2 months ago

Ah ok, I hadn't realised the cice.py driver was running with CIC5, though that makes sense though. In that case would it be safest to run the extra deletion through the access.py driver, so that there's no risk of impacting CICE5/OM2?

aidanheerdegen commented 2 months ago

In that case would it be safest to run the extra deletion through the access.py driver, so that there's no risk of impacting CICE5/OM2?

I suppose so. Or we could make a CICE4 driver if this is a CICE4 issue with not being able to sanely reduce the number of restarts.

blimlim commented 2 months ago

I have a prototype of this running from the access.py driver in this branch. I can add a pull request for it if this seems like a reasonable approach.

anton-seaice commented 2 months ago

I suppose so. Or we could make a CICE4 driver if this is a CICE4 issue with not being able to sanely reduce the number of restarts.

We can set dumpfreq in cice_in.nml to reduce the number of restarts produced.

Sorry - one more question @blimlim about changing payu - what happens if the user does want the extra restarts? How do we allow that?

blimlim commented 2 months ago

Sorry - one more question @blimlim about changing payu - what happens if the user does want the extra restarts? How do we allow that?

Good question. I'm wondering whether the extra ice restarts would ever be usable, because there won't be any corresponding atmosphere or ocean restarts?

anton-seaice commented 2 months ago

Sorry - one more question @blimlim about changing payu - what happens if the user does want the extra restarts? How do we allow that?

Good question. I'm wondering whether the extra ice restarts would ever be usable, because there won't be any corresponding atmosphere or ocean restarts?

Presumable they could be turned on at matching frequency?

anton-seaice commented 2 months ago

I will make an issue to reduce the cice restart output frequency in https://github.com/ACCESS-NRI/access-esm1.5-configs

blimlim commented 2 months ago

I will make an issue to reduce the cice restart output frequency in https://github.com/ACCESS-NRI/access-esm1.5-configs

That sounds good! I'll close this issue as we'll be adjusting the frequency via the CICE namelists rather than changing payu.