respec / HSPsquared

Hydrologic Simulation Program Python (HSPsquared)
GNU Affero General Public License v3.0
43 stars 17 forks source link

IO Abstraction #68

Closed ptomasula closed 2 years ago

ptomasula commented 2 years ago

Related Issue

59

Problem

As alluded to in issue #59, the model code is tightly coupled with HDF5. Specifically, IO operations occur directly within the model code and are handled through the use of a Pandas HDFStore instance. This tight coupling decreases code maintainability and interoperability (i.e. support for different data formats) because any changes to IO behavior involves altering numerous lines of code in many different modules.

Solution

This pull request implements IO abstraction through the use of a IOManager class. This pull request also establishes a set of IO Protocols to define the interface between a class implementing IO for a specific data format and the rest of the application. Additionally this pull request refactors the model code to route all IO operation through an IOManager class instance. This increases code maintainable because nearly all changes to IO behavior can now be addressed through modifications to the class implementing the various IO protocols and/or the IOManager class. Additionally the use of Protocols provide for increased interoperability by defining the expected interface between a data source and the application. Additional classes can now be developed that implement these protocols to provide support for data sources beyond HDF5. Lastly, this pull request includes a simple form of memory caching for timeseries by storing previously loaded timeseries inside of a dictionary in the IOManager class.

Testing

Automated testing was conducted using test10 and the RegressTest class. The HTML report was compared with that of one generated running test10 on the develop branch to ensure the results matched.

Next Steps

This pull request implements IO Abstraction and implementation of IO Protocols for HDF5 through the HDF5 Class. Further work is need to support more data formats through the development of additional classes implementing the IO Protocols. Additionally, a simply memory caching approach was implemented, which stores all previously loaded or written timeseries in a dictionary. A subsequent request for the same timeseries will first try to return a copy from the in memory dictionary before loading from the data source. This approach works well for smaller and short timespan models, but may grow to have a large memory footprint for the execution of larger models spending longer timespans. Further refinement to memory caching could be accomplished through some type of inventory management approach such as First-if First-out (FIFO) and the use of an OrderedDict. I purpose the development teams discuss the most appropriate inventory management system for cached timeseries.

ptomasula commented 2 years ago

@PaulDudaRESPEC @aufdenkampe @tredder75

With this pull request, I believe the IO abstracted version of the model is ready to merge in. As mentioned in the PR, all IO is now routed through the IOManager instance. To create a IOManager instance, it needs to be initialized with one or more class instance(s) for a data source implement various IO Protocols. In it's simplest conception though, you can just create a single HDF5 class instance and pass that to the IOManager.

  from HSP2IO.hdf import HDF5
  from HSP2IO.io import IOManager
  hdf_path = 'some_string_pathway_to_your_hdf5_file'
  hdf5_instance = HDF5(hdf_path)
  io_manager = IOManager(hdf5_instance)
  main(io_manager)

This is very similar to how executing the model currently works, but still slightly different with a few extra up front steps. This means we will also need to update the example notebooks to reflect this new approach. With the workshop in January quickly approaching, let me know your thoughts on pulling this in soon versus waiting until afterwards. I'm leaning towards pulling this is sooner so that we just teach folks the new way IO works for the model but I can see a case for waiting as well.

aufdenkampe commented 2 years ago

@ptomasula, thank you for developing this critical refactoring of the code! This work will unlock many additional opportunities to enhance performance and interoperability!

Regarding your question in https://github.com/respec/HSPsquared/pull/68#issuecomment-998252873 on whether or not we want to merge before the January workshops for ERDC, I'm strongly leaning toward moving forward with the merge now.

As @sjordan29 and I started updating the tutorial notebooks in early December (here: https://github.com/LimnoTech/HSPsquared/), I was already looking toward an immediate need to read the UCI and WDM files from our test directories yet write the output files to a temporary tutorials directory. I think this will help with that, or at least provide opportunities for us to demo the I/O approaches of our future in the workshop.

If you say it works with Test10, then I would like to merge into develop today.

PaulDudaRESPEC commented 2 years ago

@ptomasula and @aufdenkampe , I began reviewing these enhancements today -- so far they all look great! It appears that I need to start using python 3.8 (instead of 3.7) however to use Protocol from typing -- I'll make that upgrade and continue testing....

aufdenkampe commented 2 years ago

@PaulDudaRESPEC, if you have Conda working, you can create a separate Python 3.8 Conda environment for testing, while still keeping your 3.7 environment.

Try using my installation instructions and the environment_dev.yml environment file.

PaulDudaRESPEC commented 2 years ago

This all looks great, nice work. Note issue #72 was discovered while testing this enhancement, but it clearly is not a problem created by these code changes.