Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud

aufdenkampe commented 2 years ago

This high-level issue pulls together several past, current, and near-future efforts (and more granular issues).

The tight coupling of model input/output (I/O) with the Hierarchical Data Format v5 (HDF5) during the HSP2 runtime limits both performance (see #36) and also interoperability with other data storage formats such as the cloud-optimized Parquet and Zarr storage formats (see Pangeo's Data in the Cloud article) that are tightly coupled with high-performance data structures from foundations PyData libraries Pandas, Dask DataFrames, and Xarray.

Abstracting I/O using a class-based approach would also unlock capabilities for within-tilmestep coupling of HSP2 with other models. Specifically, HSP2 could provide upstream, time-varying boundary conditions for higher-resolutions models of reaches, reservoirs, and the groundwater-surface water interface.

Our overall plan was first outlined and discussed in https://github.com/LimnoTech/HSPsquared/issues/27 (Refactor I/O to rely on DataFrames & provide storage options). In brief, we would refactor to:

Run HSP2 by interacting with a dictionary of Pandas dataframes or Dask DataFrames in memory
- presently the model reads/writes to HDF5 during the model execution
Reading/writing to storage from the dictionary of dataframes is done with a separate set of functions.

cc: @PaulDudaRESPEC, @ptomasula

aufdenkampe commented 2 years ago

This is a great demo of performance profiling and optimization approaches, including I/O. https://anaconda.org/TomAugspurger/pandas-performance/notebook

Here's a great blog post by the same author that discusses benchmarking in more detail: https://tomaugspurger.github.io/maintaing-performance.html

aufdenkampe commented 2 years ago

On Nov. 18, @PaulDudaRESPEC committed 0ed2302f43efbb98e8ecc2f19177ef0b609b617f, which replaced multiple reads of data from storage with a single read into memory and subsequent data access to those in-memory objects. He shared this comment via email:

I just implemented a very significant performance improvement. Looking back at get_flows, I realized the old design was going back to the h5 file to read timeseries computed by upstream operations – I hadn’t really focused on that before – but of course very inefficient to do that reading from the file. Instead of doing that, I’ve saved those timeseries in-memory for later use. I’ve checked it into the repo if you’d like to take a closer look.

One example project used to run in 3.5 minutes, now runs in less than 1.5 minutes!

aufdenkampe commented 2 years ago

The foundation of this work was completed and tested with:

68

So we'll close this issue.

We'll expand on the I/O Abstraction capabilities, including implementing additional storage formats, with new issues.

respec / HSPsquared

Abstract I/O & storage beyond HDF5 for flexibility, performance, & cloud #59

68