payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Add support for date-based restart frequency #363

Closed jo-basevi closed 8 months ago

jo-basevi commented 9 months ago

Add support for date-based restart pruning in Payu - should close #358.

As suggested by @aidanheerdegen here, an option is to support a subset of pandas date offset frequency aliases: Unit Description
MS Month start frequency
YS Year start frequency
M Month frequency
Y Year frequency
W Weekly frequency
D Day frequency
H Hour frequency
T Minute frequency
S Second freqeuncy

So for example, if in config.yaml,

restart_freq: 5YS

Then the earliest restart of the year, every 5 years from the first restart will be retained.

During archive, the function Experiment.prune_restarts() would inspect each restart directory and return a list of restart directories that can removed using integer or date-based restart frequency. For date-based frequency, individual restart directory paths are passed to the model specific driver that can parse the restart files and return a cftime (or standard) datetime object.

In this branch, access-om2 has the model.get_restart_datetime() function implemented, which calls the mom's get_restart_datetime() which parses the ocean_solo.res file for the final datetime of the restart. I used this file as it is what is used in COSIMA's tidy_restarts script and it also contains information on what calendar it is using.

Currently the month (M) and year (Y) frequencies are defined similar to adding relativedelta's month and years. So for example if a datetime was 15/01/2000, after adding an offset of '6M' it would be 15/07/2000. I have only just noticed that 'M' in pandas documentation is actually the end of the month.. Would the end of the month/year be more useful? Also, with the different cftime calendars M and Y could give unexpected results i.e with different day months. For example, I've got some logic for cftime's360_day, noleap and all_leap calendars but nothing yet for the julian calendar. Also would the start of month/year (MS/YS) frequencies be sufficient- Is M and Y frequencies even needed?

Week, day, hour, minute frequencies may not be super useful or even necessary but they were easy to add as adding timedeltas is supported with both cftime and standard datetime calendars.

The first restart datetime is used as a point of reference for adding the first frequency interval. Then the next restart with a datetime at or after this checkpoint is kept and then becomes the reference when adding the next time interval. I think using the most recent "kept" datetime as a reference for the next checkpoint could be better than strict regular intervals from the earliest datetime. As the first restart could eventually be lost due to scratch timeouts and using "YS"/"MS" would still keep the earliest restarts in each year/month.

As one of the motivations for date-based restart frequencies was to make syncing restarts to a remote archive easier as payu would know what restarts could be stored permanently. A way to check if a restart would eventually be removed would be to see if the restart directory was in the list returned by prune_restarts(from_n_restart=restart_history - 1, to_n_restart:self.counter) where restart_history is the config.yaml or default value of how many latest restarts are kept.

Any feedback, questions or suggestions is much appreciated!

coveralls commented 9 months ago

Coverage Status

coverage: 45.573% (+2.8%) from 42.772% when pulling bc0aa459665ac5ce8aa79132a5766fb25796b276 on jo-basevi:358-date-based-frequency into cfd694aa4744b1c7578e18a8993650df9bd88bfe on payu-org:master.

pep8speaks commented 9 months ago

Hello @jo-basevi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 209:20: W291 trailing whitespace

Comment last updated at 2023-10-11 00:08:23 UTC
jo-basevi commented 9 months ago

Main changes were removing M and Y offset options which simplified the code quite a bit as all cftime calendars are supported with YS - Year start and MS - Month start (i.e don't have to worry about the end of month logic..).

Other changes were rewriting the calendar tests and added integration-like tests for parsing MOM restart files and the prune_restarts() method in the Experiment class.

Other Notes I removed this from the documentation:
" Intermediate restarts are not deleted until a permanently archived restart has been produced. For example, if we have just completed run 11, then we keep restart004, restart009, restart010, and restart011. Restarts 10 through 13 are not deleted until restart014 has been saved."

Because the logic previously was to not delete the last restart_history number of restarts. Intermediate restarts were not deleted until a permanently saved restart was only automatically the case when restart_freq was 5 or less as the default restart_history was also 5. Or if restart_history was also configured to be greater than or equal to restart_freq. I could change it so restart_history is set to be minimum of restart_freq to guarantee immediate restarts aren't pruned too early for integer restart frequency and add it back into the documentation?

Other things I got caught out on was the difference between 12MS and 1YS, as the logic uses the runtime of earliest restart as a reference (not the model start time). Say for example, there were runs with runtime of 1 month, with a mom model start time of 1900/01/01: (run_date, restart_directory) (1900/02/01, restart000) (1900/03/01, restart001) (1900/04/01, restart002) ...

If using frequency 12MS, the restarts that would be saved are: (1900/02/01, restart000) (1901/02/01, restart012) (1901/02/01, restart024) ...

While 1YS would keep (1900/02/01, restart000) (1901/01/01, restart011) (1901/01/01, restart023) ...

The other thing to keep in mind is that the code currently uses the next kept restart as a reference for the next time interval. This means that restarts saved may not be uniformly a fixed frequency from the first restart if there's irregular time steps.

For restart_freq: 6MS, theres a minimum 6 months between each saved restart. So with using the restart's list from above, the restarts saved would be: (1900/02/01, restart000) (1900/09/01, restart002) (earliest restart after 6 months) (1901/03/01, restart004) (6 months after previous kept restart)

If instead there was a fixed frequency from the first restart, the restarts kept would be: (1900/02/01, restart000) (1900/09/01, restart002) (1901/02/01, restart003) If restart000 was lost somehow then the logic would break as restart002 would be the new reference and restart003 would be purged in a subsequent run through..

aidanheerdegen commented 9 months ago

Other Notes I removed this from the documentation: " Intermediate restarts are not deleted until a permanently archived restart has been produced. For example, if we have just completed run 11, then we keep restart004, restart009, restart010, and restart011. Restarts 10 through 13 are not deleted until restart014 has been saved."

Because the logic previously was to not delete the last restart_history number of restarts. Intermediate restarts were not deleted until a permanently saved restart was only automatically the case when restart_freq was 5 or less as the default restart_history was also 5. Or if restart_history was also configured to be greater than or equal to restart_freq. I could change it so restart_history is set to be minimum of restart_freq to guarantee immediate restarts aren't pruned too early for integer restart frequency and add it back into the documentation?

Good catch. Yes i think restart_history used should be min(restart_history, restart_freq). It doesn't make sense to keep five previous restarts if the restart_freq is 2. So that would mean reinstating that bit in the docs, though it does need editing to ay it uses restart_history and keeps min(restart_history, restart_freq) of the most recent history files.

How does this affect date based pruning though? We can keep restart_history as an integer (just keep the n most recent restarts) but then what doesmin(restart_history, restart_freq)mean? Would you translaterestart_freq` into an integer number of runs? Not sure how they could be determined.

Other things I got caught out on was the difference between 12MS and 1YS

So this is just something that needs to be kept in mind by the user?

The other thing to keep in mind is that the code currently uses the next kept restart as a reference for the next time interval. This means that restarts saved may not be uniformly a fixed frequency from the first restart if there's irregular time steps.

For restart_freq: 6MS, theres a minimum 6 months between each saved restart. So with using the restart's list from above, the restarts saved would be: (1900/02/01, restart000) (1900/09/01, restart002) (earliest restart after 6 months) (1901/03/01, restart004) (6 months after previous kept restart)

The restarts from above were monthly, so I'm confused by (1900/02/01, restart000) (1900/09/01, restart002) shouldn't it be (1900/02/01, restart000) (1900/09/01, restart007)?

If instead there was a fixed frequency from the first restart, the restarts kept would be: (1900/02/01, restart000) (1900/09/01, restart002) (1901/02/01, restart003)

Which is the same as above .. no?

If restart000 was lost somehow then the logic would break as restart002 would be the new reference and restart003 would be purged in a subsequent run through..

I'm still not sure those restart numbers are correct, or if they are I understand why, so it's a bit hard to make an informed comment on this bit.

Just some general questions:

  1. If a user would like to always save restarts at the beginning of a year (or as close as possible) is that possible?
  2. Do we need to save some state about previous inferred restart frequencies to be able to warn users if they make a change that could be deleterious?
jo-basevi commented 9 months ago

I'm still not sure those restart numbers are correct, or if they are I understand why, so it's a bit hard to make an informed comment on this bit.

Sorry that's my bad!! I deleted the list of restarts I was talking about. So here is the list of restarts I was referring to:

(1900/02/01, restart000)
(1900/07/01, restart001)
(1900/09/01, restart002)
(1901/02/01, restart003)
(1901/03/01, restart004)
...
jo-basevi commented 9 months ago

If a user would like to always save restarts at the beginning of a year (or as close as possible) is that possible?

Yes, using 1YS would save the restart where current model run time is the earliest of each year.

Do we need to save some state about previous inferred restart frequencies to be able to warn users if they make a change that could be deleterious?

Would that require saving the previous restart frequency as an enviroment variable? Changing to a longer date-based period or a shorter integer frequency would not result in deletions. Though it would be tricky comparing an integer frequency with a datebased frequency..

As archive() runs after each experiment run, if there is greater than 1 restart to prune, that will indicate that either:

Instead of saving previous state, during setup and if archive is enabled, I could inspect what restarts it would prune (but not actually delete them..). If restart_freq and restart_history are the same, it should be an empty list at that stage. If not, emit a warning that the config has changed, and more restarts will be deleted at the next time archive is run.

How does that sound? Running that check in setup would also have the benefit for any warnings regarding invalid restart_freq values.

jo-basevi commented 9 months ago

Good catch. Yes i think restart_history used should be min(restart_history, restart_freq). It doesn't make sense to keep five previous restarts if the restart_freq is 2. So that would mean reinstating that bit in the docs, though it does need editing to ay it uses restart_history and keeps min(restart_history, restart_freq) of the most recent history files. How does this affect date based pruning though? We can keep restart_history as an integer (just keep the n most recent restarts) but then what does min(restart_history, restart_freq)mean? Would you translaterestart_freq` into an integer number of runs? Not sure how they could be determined.

Ok, an idea for an simple algorithm pattern to keep intermediate restarts is:

restarts_to_prune = []
intermediate_restarts = []
for restart in restarts:
    if restart_should_be_pruned():
        # keep track of intermediate restarts 
        intermediate_restarts.append(intermediate_restarts)

    else:  # restart should be saved
        # add prior intermediate restarts to the list of restarts_to_prune
        restarts_to_prune.extend(intermediate_restarts)
        intermediate_restarts = []

This will work for both integer and date_based frequency and wouldn't require a restart_history config variable at all.. If restart_history is useful to keep, it could act as an override option to ensure that the last n restarts are always kept?

aidanheerdegen commented 9 months ago

How does that sound? Running that check in setup would also have the benefit for any warnings regarding invalid restart_freq values.

That sounds like a great idea, thanks. I definitely think there is potential for Bad ThingsTM to happen, and checks and emitting information to the user is a good idea.

I definitely can see a use-case for someone to decide to change their restart_frequency when their experiment progresses. This could be to reduce disk usage or just because they realise they don't need so many.

But I think we want to make this sort of mass-deletion an interactive only option. So warn that lots of restarts would be the result of the next archive, so make them use a -f/--force flag to make it happen, otherwise no restarts are pruned?

Specifying flags isn't something you can do as part of an automated run, so by default it has to be done interactively.

jo-basevi commented 9 months ago

New Changes

Support for both mom5 and mom6 models

Intermediate restarts and restart_hisory

Force restart pruning

aidanheerdegen commented 8 months ago

Move mom driver restart parsing method to fms driver so it can work with mom6 and mom models (as both have the ocean_solo.res file). The other fms model subclass is GOLD. I’m not too sure whether it has similar file or not.. So the method checks first that the file exists and raises a descriptive error if not.

I was wrong when I said this was FMS based, I mean, they're all FMS based, and so end up using the same convention, but the naming of the file is in the model driver code, not FMS. Regardless I think it is the right decision to put the code in the FMS driver for the moment.

I guess longer term the best option might be to put it in a mixin class and inherit that from the models that use ocean_solo.res.

jo-basevi commented 8 months ago

Thanks for the review @aidanheerdegen! I've added in the suggested changes and the small fix to counter > 999. Should close #372

aidanheerdegen commented 8 months ago

I've replied where appropriate and resolved all the conversations, so good to merge.