payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Automated Configuration of "skip_restart_read" Flag for ACCESS-OM3 #349

Closed ezhilsabareesh8 closed 11 months ago

ezhilsabareesh8 commented 11 months ago

Currently, the MOM6-CICE6 configuration in ACCESS-OM3 requires manual modification of the skip_restart_read flag in the datm_in and drof_in files before running the model. For the first run, the flag needs to be set to skip_restart_read = .true., and for subsequent runs, it should be changed to skip_restart_read = .false. This has to be automatically detected by payu for the first run and set skip_restart_read = .true. in both datm_in and drof_in files. Subsequently, for all subsequent runs, the flag should be automatically changed to skip_restart_read = .false..

aidanheerdegen commented 11 months ago

The cice model is potentially a useful model for how to do this.

This is where it is actually set:

https://github.com/payu-org/payu/blob/master/payu/models/cice.py#L190-L192

and it depends on self.prior_restart_path being defined, which is done this block

https://github.com/payu-org/payu/blob/aacfd92570d6f33487ec4de47f9ad8b7a7fa8f12/payu/experiment.py#L348-L364

Note that cice relies on the relevant namelist options (runtype and restart) being correctly set for a non-restart run at the beginning. There is nothing stopping the ACCESS-OM3 driver from enforcing this and setting these values correctly for a non-restart run, and I would probably recommend this.

dougiesquire commented 11 months ago

Sorry @ezhilsabareesh8, I missed this issue. Are the skip_restart_read flags ever actually used for the CDEPS data modes we use in ACCESS-OM3? I'm confused about why restarts would be read when all our data components do is read data from a file and pass it to the mediator. Am I being dense? Is it needed for the time interpolation?

ezhilsabareesh8 commented 11 months ago

@dougiesquire skip_restart_read is used in CDEPS in datm nuopc atm_comp_nuopc.f90 as a part of time initialisation.

dougiesquire commented 11 months ago

Yes, looks like it's used for the time interpolation. I'm trying to understand why things are currently working. It looks like the restart file is only read by CDEPS if a restart file exists, which maybe explains it, but it's not immediately clear to me why it doesn't fail on shr_sys_abort calls here and here. Will be easier to debug from the logs once Gadi is back online.

Regardless, I agree that Payu should explicitly set skip_restart_read. There are possibly also some other restart flags that should be set explicitly.

dougiesquire commented 11 months ago

Aha, a little clarity. From https://escomp.github.io/CDEPS/versions/master/html/design_details.html#restart-files:

In most cases, no restart file is required for the data models to restart exactly. This is because there is no memory between timesteps in many of the data model science modes. If a restart file is required, it will be written automatically and then must be used to continue the previous run.

There are separate stream restart files that only exist for performance reasons. A stream restart file contains information about the time axis of the input streams. This information helps reduce the startup costs associated with reading the input dataset time axis information. If a stream restart file is missing, the code will restart without it but may need to reread data from the input data files that would have been stored in the stream restart file. This will take extra time but will not impact the results.

dougiesquire commented 11 months ago

Okay, so I've changed my mind. I actually don't think we need to explicitly set skip_restart_read and it is correct to leave it as skip_restart_read=.false..

@ezhilsabareesh8 can you check that this make sense and that you agree?

ezhilsabareesh8 commented 11 months ago

which maybe explains it, but it's not immediately clear to me why it doesn't fail on shr_sys_abort calls here and here.

Thanks @dougiesquire, when I tried to run with skip_restart_read = .false. for the first run, the code was looking for the restart files and it returned call shr_sys_abort (ERROR reading in nfiles) error.

Indeed, it is true that if restart_read = .true. and skip_restart_read = .false., the restart files are not required, as described here. However, it seems that the error might be related to the automatic resetting of restart_read by payu based on this condition. , while skip_restart_read is explicitly defined as .false. in datm_in and drof_in. To gain more insights into the issue, we can re-run the configuration once Gadi is back and observe how the payu driver sets start_type="continue."

dougiesquire commented 11 months ago

@ezhilsabareesh8 I am unable to reproduce your issue, even using your configuration on Gadi. Would you be able to point me somewhere that produces the issue?

ezhilsabareesh8 commented 11 months ago

Thanks @dougiesquire. My apologies, but I couldn't reproduce the error I encountered earlier when I set skip_restart_read = .false. for the first run. The code executed successfully without any issues this time. It seems that the problem might have been caused by a different factor. If I face a similar situation again in the future, it might be worth checking this issue again.

aidanheerdegen commented 11 months ago

Ok. We'll close this, but re-open if you find the same thing happening again @ezhilsabareesh8.