Option to save intermediate simulation states not working

ohagen / gen3sis_rf

Repository of the R-packageGen3sis

https://project-gen3sis.github.io/R-package/

GNU General Public License v3.0

1 stars 0 forks source link

Option to save intermediate simulation states not working #1

Open castroinsua opened 4 months ago

castroinsua commented 4 months ago

The save_state parameter in run_simulation() is currently not working as intended. Calling this function with, for example, save_state = "last" will throw the following error:

Error in list.files(val$config$directories$output_val, full.names = TRUE) : 
  invalid 'path' argument

The reason for this is that val$config$directories$output_val is never initialised. I see that there is some code commented out in prepare_directories(), and my guess is that this code was removed either to avoid creating an empty directory that is never used or because some of the documented options for the save_state parameter (namely, passing a numeric vector of specific time steps to save) are not implemented yet.

Saving the state of the simulation can take a lot of time if the resulting .rds files are large (and they usually are, I think). This is more of a feature request, but when running simulations with thousands of time steps I would like to be able to save the state at specified times (for example, each 200 steps) to restart the simulation from there, but only keeping the latest state to save disk space.

Currently, the available options to save_state are: NA to never save, "all" to save all time steps (excessive in most cases, since writing the state files can take a lot of time), "last" which does the same but removing any previous saves, or a numeric vector to save at specified time steps (not implemented yet), keeping them all. I propose using two different options in run_simulation() instead:

save_state: Which time steps to save, NULL by default. Could be a numeric vector or "all". If not specified, do not create any saves.
last_state_only: A logical, TRUE by default. Whether to keep all saved states or only the last one.

Adding another option might not be the best approach, but I think it is the right thing to do if this feature is to be implemented.

castroinsua commented 4 months ago

I have created a new branch with my first attempt to restore this functionality. I did a few tests and things seem to be working properly, and it is possible to restart the simulation from the saved states reproducibly. I think the documentation of the new options I added can be a little bit confusing, so I might need to rephrase things. I will also wait for some opinions.

benj919 commented 3 months ago

Hi @castroinsua Thank you for your interest in this issue and taking the time to work on it. I'll reply here for the technical discussion.

The logic for saving and restoring is/was mostly intact, but the feature got left behind a little bit because of one big change that we didn't integrate fully: gen3sis can be run with a config file or a config object. Any attempt to restore the save and restart functionality must work consistently to be merged back into the main repository: Either with both options for a config file and object or by removing the config object functionality. It's possible that running a simulation from a config object is currently defunct and can be removed easily anyway. But I'll insist on either of these two solutions being implemented before merging back into the main package.
The starting point for that investigation is probably here: https://github.com/project-gen3sis/R-package/blob/master/R/gen3sis_main.R lines 102-110.
save_state: I strongly recommend sticking to a single parameter and just extending the accepted options: NA, "all", "last", numeric_vector. That gives a single point of reference. Already your default values leave some room for confusion: Does that mean the last state is saved or not?:
```
save_state = FALSE,
last_state_only = TRUE,
```
Does this save all or not?:
```
save_state = "all",
last_state_only = TRUE,
```
Have you benchmarked the save file creation in comparison to the overall runtime? First lesson of R and performance is that those two don't go together in the first place ;) large save states usually also imply large numbers of species, so running a simulation step will take a lot of time anyway.

Yes sticking to one parameter leaves a blind spot for saving every few steps and only keeping the last step. Experience tells me that this is okay as there's usually two options when running gen3sis with very few simulations in between: either the simulation runs very fast, and saving will be fast as well. Or it will run painfully slow and you'd want to save every timestep anyway.

val$config$directories$outputval: we decided at some point to just carry the main output directory in the config value, and have individual functions handle the subfolders directly. See the save* functions here https://github.com/project-gen3sis/R-package/blob/master/R/observations.R . That is not the most elegant way of handling it, but it was a bit cleaner at the time. Especially with the observer functionality allow users to create their own subfolders.

cheers :wave:

castroinsua commented 3 months ago

The logic for saving and restoring is/was mostly intact, but the feature got left behind a little bit because of one big change that we didn't integrate fully: gen3sis can be run with a config file or a config object.

I see. However, the saved state already contains the config object, so I am not sure why it should matter if the simulation was started using a config file or an object. I will have a look at the parts of the code you pointed out to make sure that I understand what you mean. Should not the config parameter of run_simulation() be ignored when a simulation is restarted from an intermediate state?

The reason I decided to add an option to only keep the last state is because I am running some simulations with thousands of time steps. Even when saving each 200 steps or so, I do not want to keep all of them, as each step takes hundreds of MB and I would need dozens of GB for each simulation when I just want one checkpoint from time to time. And creating a save each time step and deleting the older files is not really an alternative here, as it takes a long time to save.

I agree that the argument names make everything very unclear, and your examples are exactly what I was thinking about when I wrote that things are still confusing. I guess it is better to do as you say and keep only the original parameter, but then I will have to do some manual housekeeping and delete the files that I do not need in my particular case.

Have you benchmarked the save file creation in comparison to the overall runtime?

Yes, that is why I decided to save only each few hundred time steps. I think this does not have anything to do with R but with the speed at which you can write to disk. How slow it is ultimately depends on the simulation (that is, how large the state file is) and in any case the user is the one that has to decide if creating the saves is worth the time. In my case, it is large not because there is a large number of species in the simulation but because of the landscape.

As I said, I will take a look at the code to try to understand how the package handles config files and objects, because there was probably a very good reason why the code to save the state was commented out, but I still do not really know why.

Thank you very much for taking your time to review and discuss this.

benj919 commented 3 months ago

I see. However, the saved state already contains the config object, so I am not sure why it should matter if the simulation was started using a config file or an object. I will have a look at the parts of the code you pointed out to make sure that I understand what you mean. Should not the config parameter of run_simulation() be ignored when a simulation is restarted from an intermediate state?

The config parameter defines the "final" output directory; to restart a simulation you need to be able to find and access the existing simulation directory. The output directory in the run_simulation call is the parent directory, not the one for each individual simulation. The goal with the save and restart was to restart simulations on clusters with runtime limits. "Just" rerun the batch file with no(!) changes, and the simulations continue.

The reason I decided to add an option to only keep the last state is because I am running some simulations with thousands of time steps. Even when saving each 200 steps or so, I do not want to keep all of them, as each step takes hundreds of MB and I would need dozens of GB for each simulation when I just want one checkpoint from time to time. And creating a save each time step and deleting the older files is not really an alternative here, as it takes a long time to save.

Please don't, you're inviting a world of hurt onto yourself. The restart functionality is intended for restarting simulations in case of a crash, nothing else. Trying to have checkpoints etc is pointless; You shouldn't change the config for any given running simulation at all. If you want intermediate results for downstream analyses use the observer function and save what you need there.

What use case are you trying to cover by keeping the internal save states around?

As I said, I will take a look at the code to try to understand how the package handles config files and objects, because there was probably a very good reason why the code to save the state was commented out, but I still do not really know why.

a "very good" reason is relative, tbh it wasn't: We were working on refactoring a bunch of things, including directory handling. Handling directories changed, the config object instead of files was introduced, and we didn't fully match those changes in the restart function (see your error with output_val), plus the code snipped below. And then we just didn't follow up on it due to time constraints.

If you look into R/config_handling.R lines 62-66 you see this:

#output folders
  if(is.na(config_file)[1]|is(config_file, "gen3sis_config")) {
    config_name <- file.path("default_config", paste0(format(Sys.time(), "%Y%m%d%H%m"), "-", formatC(sample(1:9999,1), digits=4, flag="0")))
  } else {
    config_name <- tools::file_path_sans_ext(basename(config_file))
  }

That is a very questionable piece of code; running a default config is probably the worst idea on what to do, followed closely by just naming the output directory something random in case of a config object. That means if you try to restart a existing simulation from a config object it will not pick the existing/correct directory and start the simulation from scratch somewhere else.

A possible fix might be to add a variable into the internal state with the simulation name; either the config name, or a unique name for a config object. Or again, drop the config object from run_simulation. Both options are straight forward enough, but take a little of time and care to implement correctly.

cheers

castroinsua commented 3 months ago

I have reverted the change that added a new parameter and simply implemented the saving functionality as it was originally documented in run_simulation(), since I agree with you that one particular use case does not justify the added complexity and the change of interface. But I really wanted to have that functionality somehow, so I tried to write it. I have also taken care of the output_val directory in the corresponding functions instead of creating a field in the config object. At first, I just followed the pattern that I saw with the config$directories$output_plots directory, which is defined in prepare_directories(), but I did not know that you only wanted the config$directories$output field there. I guess the output_plots field remained there to be used in the observer function defined by the users.

Thank you for having the patience to explain all of this.

However, I have not still addressed the most important part, which is taking care of the possibility of using config objects instead of a config file. I would gladly help with this, but since I have never passed config objects to run_simulation(), I am not sure how to test it.

I think I now understand the problem you described, and it seems that it stems from how the directories field of the config object is set, because that is from where restore_state() tries to load the state object. The prepare_directories() function will set it to a random name with the date when a config object is passed to it (or more specifically, the directories field will be set from the value returned by this function). For what I could read, this function seems fragile when either the input_directory or the output_directory are not set as well, and also, in the code you cite:

config_name <- file.path("default_config", paste0(format(Sys.time(), "%Y%m%d%H%m"), "-", formatC(sample(1:9999,1), digits=4, flag="0")))

The format string seems to be wrong, as the second %m should be %M for minutes (not month); and in formatC(), I think that the width parameter should be used instead of digits. But as you mentioned, maybe this is not the best thing to do in this case anyway.

It seems that there is some work to do in prepare_directories(), but I am not sure how to proceed. What should we expect from the user? I am thinking that if a config object is used instead of a file, it would be better that the user explicitly sets in this object the input and output directories (in config$directories). And, in general, it seems that it would be a good idea to require the user to always specify the input and output directories. But since the code is already there, there are probably some people that do not specify them, and I do not want to break things if that is the case.

So I guess that one specific question that I have right now is: how would you normally create a config object from scratch, instead of generating it from a config file?

Sorry for not being more succinct, at the moment I am still trying to figure out things. Thank you again for your help.

castroinsua commented 3 months ago

What use case are you trying to cover by keeping the internal save states around?

I do not have access to a cluster at the moment, and these are tests anyway, so I am running them locally. I have to turn the computer off when I leave, so in practice it is like a crash. I am not changing the config at all. So what I need is to restart the simulation from the last saved state the next day. However, the "last" option did not really work for me, since it would take a prohibitive amount of time to use it, and what I really needed was the option to save only in some specified time steps, but preferably not keeping all of the saves, since they are rather large in this particular case. As I said, this is because of the landscape, so the simulation is not running very slowly necessarily, but there are a lot of time steps.

But we do not have to worry about this any more, since I have already removed the problematic additional parameter. I will take care of this particular case by myself, without complicating the code of the package. I thought that it was worth considering, that is why I wanted to discuss this with the people maintaining it.

benj919 commented 3 months ago

What use case are you trying to cover by keeping the internal save states around?

I do not have access to a cluster at the moment, and these are tests anyway, so I am running them locally. I have to turn the computer off when I leave, so in practice it is like a crash. I am not changing the config at all. So what I need is to restart the simulation from the last saved state the next day. However, the "last" option did not really work for me, since it would take a prohibitive amount of time to use it, and what I really needed was the option to save only in some specified time steps, but preferably not keeping all of the saves, since they are rather large in this particular case. As I said, this is because of the landscape, so the simulation is not running very slowly necessarily, but there are a lot of time steps.

But we do not have to worry about this any more, since I have already removed the problematic additional parameter. I will take care of this particular case by myself, without complicating the code of the package. I thought that it was worth considering, that is why I wanted to discuss this with the people maintaining it.

Answers in reverse order. Makes sense. Tbh I could also live with changing the behaviour of save_state=numeric_vector to "keep last entry only" :thinking: Besides some rare instances of debugging I don't see much use for the "all" variant, so I'd be happy to make the numeric_vector option more useful in general.

benj919 commented 3 months ago

I have reverted the change that added a new parameter and simply implemented the saving functionality as it was originally documented in run_simulation(), since I agree with you that one particular use case does not justify the added complexity and the change of interface. But I really wanted to have that functionality somehow, so I tried to write it. I have also taken care of the output_val directory in the corresponding functions instead of creating a field in the config object. At first, I just followed the pattern that I saw with the config$directories$output_plots directory, which is defined in prepare_directories(), but I did not know that you only wanted the config$directories$output field there. I guess the output_plots field remained there to be used in the observer function defined by the users.

In R/plotting_functions.R lines 314+ conditional_plot we explicitly set the plots directory again according to the rest of the saving functions. I think we/you could also remove the output_plots variable completely :shrug:

Thank you for having the patience to explain all of this.

You're welcome, I'm quite happy to see you using gen3sis and putting this effort into it to :)

However, I have not still addressed the most important part, which is taking care of the possibility of using config objects instead of a config file. I would gladly help with this, but since I have never passed config objects to run_simulation(), I am not sure how to test it.

R/config_handle.R create_input_config(...) (line 93+) allows you to get a config object, either empty or initialized from a config file.

I think I now understand the problem you described, and it seems that it stems from how the directories field of the config object is set, because that is from where restore_state() tries to load the state object. The prepare_directories() function will set it to a random name with the date when a config object is passed to it (or more specifically, the directories field will be set from the value returned by this function).

Yes.

For what I could read, this function seems fragile when either the input_directory or the output_directory are not set as well, and also, in the code you cite:
config_name <- file.path("default_config", paste0(format(Sys.time(), "%Y%m%d%H%m"), "-", formatC(sample(1:9999,1), digits=4, flag="0")))
The format string seems to be wrong, as the second %m should be %M for minutes (not month); and in formatC(), I think that the width parameter should be used instead of digits. But as you mentioned, maybe this is not the best thing to do in this case anyway.

Yes, we just needed a unique string. Technically correct, the best correct, on both counts for %M and width. Though again, yes, we don't really care what ends up in that string as long as it's unique :see_no_evil:

It seems that there is some work to do in prepare_directories(), but I am not sure how to proceed. What should we expect from the user? I am thinking that if a config object is used instead of a file, it would be better that the user explicitly sets in this object the input and output directories (in config$directories). And, in general, it seems that it would be a good idea to require the user to always specify the input and output directories. But since the code is already there, there are probably some people that do not specify them, and I do not want to break things if that is the case.

I think the elegant way to proceed would be to set a config_name variable in the config object. When parsing a config file you can take the file name, which corresponds to the current behaviour. When using a config object the user can set the name to whatever, and the create_directories function can just retrieve the name from the object. That would not fully solve the need to parse the config file name twice, once in create directories, and once in parse_config, but it would result in a consistent state between the config file parsing and the config object. (Or again, remove the config object option)

The implementation would probably follow something like this:

add a variable config_name in R/config_handling.R create_empty_config (line 207+). Possibly initialize it to NULL so it gets flagged when not set and verifying a config. Additionally populate it in create_input_config from the config file name.
adjust prepare_directories to either go by config file name, or the config_name value in case of a config object.
probably adjust the restore_state function to use this new config_name variable

So I guess that one specific question that I have right now is: how would you normally create a config object from scratch, instead of generating it from a config file?

See above, R/config_handle.R create_input_config(...)

Sorry for not being more succinct, at the moment I am still trying to figure out things. Thank you again for your help.

You're welcome and no worries, the code base is not the most straight forward one :see_no_evil: :eyes:

castroinsua commented 3 months ago

Though again, yes, we don't really care what ends up in that string as long as it's unique 🙈

If you want a more consistent method of assigning a unique name, you could maybe use a hash of the config object. Then, if you pass the same config object you should always get the same hash. However, you would probably need functions from some other package, so in the end it is introducing another dependency. That could be a possible solution if you want to do something similar, but I think there are better solutions here (such as what we were discussing).

Which approach would you prefer: setting a config_name variable in the config object or simply removing the config object option? I do not know how many people are using config objects instead of config files, so I do not have an opinion. I just do not want to break things that are working.

benj919 commented 3 months ago

:shrug: Tbh I don't know either how much the config object is used, if at all... I'm also not sure if it's in the vignettes. In general I'm not too much of a fan of deprecating functions out of development convenience though... The more I think about it, the more I prefer the config_name setting for that reason alone.

Tbh I don't like the hash naming too much either as it doesn't give any hint on where the corresponding output folder comes from. Admittedly it's rather weak argument when currently only the timestamp gives you any hint :see_no_evil: so back to setting an explicit name.

How motivated (and interested) are you in working on this issue? :innocent:

castroinsua commented 3 months ago

How motivated (and interested) are you in working on this issue? 😇

I believe that I can at least give it a try, we will see how it goes. I will get back to you when I have something written, or if I have any questions. I think it would be nice to have this functionality available again.

castroinsua commented 3 months ago

I have attempted to implement what you suggested. The few tests that I made with config objects seem to work as intended.

I was not completely sure where to add the new variable. By looking at verify_config() I decided to put it under the gen3sis$general. I had a problem when adding the config_name variable to the object returned by create_empty_config(). As it is right now, the order in which variables are defined matters because in plot_summary() these parameters from the config are indexed by position (see the sumss variable there, around line 128). I worked around this issue by placing the new variable at the end of the general category (I put it at the start at first). I found out the problem by chance in plot_summary() when running the tests, but it could potentially happen in other parts of the code. And I might have not noticed it if the types of the displaced elements in the list were the same. Although things are working for now, do you think I should still modify plot_summary()? I do not know if there are other functions that do the same thing.

Another thing I would like to address is defining the default input and output directories in prepare_directories() when they are not specified. I would like to change this code because there are some substitutions in the path that could fail. What is the assumed directory structure if these arguments are not provided? Another problem that exists is that the code will try to use strsplit() in a gen3sis_config object if you pass a config object and do not specify an input_directory.

Somewhat related to this, I am tempted to move the code that creates and verifies a config object before calling prepare_directories(), and then modify this function to assume that a config object with a valid structure is being passed to it. This could avoid some repetition and potentially make things more readable. However, it would lose the information of the original config file path, which is (as far as I know) only used in this function for determining default input and output directories (and the simulation subdirectory, but now it is possible to get this from the config_name in the object). I could put the config file path somewhere in the config object if it is needed.

castroinsua commented 3 months ago

Related to my previous comment, I just realised that there is an open issue in project-gen3sis/R-package#10.

benj919 commented 3 months ago

Sorry for the late answer, it sat in the reply box as a draft :man_facepalming: sending it off usually helps

I have attempted to implement what you suggested. The few tests that I made with config objects seem to work as intended.

I was not completely sure where to add the new variable. By looking at verify_config() I decided to put it under the gen3sis$general. I had a problem when adding the config_name variable to the object returned by create_empty_config(). As it is right now, the order in which variables are defined matters because in plot_summary() these parameters from the config are indexed by position (see the sumss variable there, around line 128). I worked around this issue by placing the new variable at the end of the general category (I put it at the start at first). I found out the problem by chance in plot_summary() when running the tests, but it could potentially happen in other parts of the code. And I might have not noticed it if the types of the displaced elements in the list were the same. Although things are working for now, do you think I should still modify plot_summary()? I do not know if there are other functions that do the same thing.

Thank you for the feedback. Those are all valuable reports. Feel free to also report them as bugs on the main repo, specifically the indexed element access. In general adding new variables at the end of the existing stuff is a good idea for exactly this reason

Another thing I would like to address is defining the default input and output directories in prepare_directories() when they are not specified. I would like to change this code because there are some substitutions in the path that could fail. What is the assumed directory structure if these arguments are not provided? Another problem that exists is that the code will try to use strsplit() in a gen3sis_config object if you pass a config object and do not specify an input_directory.

I'm open to remove all those substitutions and mandate providing paths :+1: So the check would be for the config file if the paths are provided, and for the config object if the paths are set in the object itself.

Somewhat related to this, I am tempted to move the code that creates and verifies a config object before calling prepare_directories(), and then modify this function to assume that a config object with a valid structure is being passed to it. This could avoid some repetition and potentially make things more readable. However, it would lose the information of the original config file path, which is (as far as I know) only used in this function for determining default input and output directories (and the simulation subdirectory, but now it is possible to get this from the config_name in the object). I could put the config file path somewhere in the config object if it is needed.

Possibly yes, we do copy the config file to the output directory. Though it's trivial to keep track of the file name and do a conditional copy there with no need to keep the path in the config object itself In R/gen3sis_main.R lines 106/107:

  } else if (is(config, "character")){
    file.copy(config, directories$output)

castroinsua commented 3 months ago

About the problems with plot_summary() indexing by position, I will open a separate issue in the main repository as you suggested after I am done reimplementing this save_state option.

I'm open to remove all those substitutions and mandate providing paths

I will ask Oskar for an opinion about this when I have the opportunity (possibly this week).

castroinsua commented 2 months ago

Since the modifications I made change the previous behaviour of the package, and we might also change the interface so that the input and output directories have no defaults, I think that for now I will try to incorporate these only to the new version of gen3sis (the one that is being developed in this repository). Apparently, the old version and this one will be developed separately, and changes to the old one should be mostly bug fixes.

I still have to get the new version running to be able to test things, but that is a different matter.

However, I do not think this is set in stone, and maybe the functionality to restart the simulation will be eventually added to the old version as well.