hdf5 file sizes are enormous

rburghol commented 2 years ago

Testing for system space requirements when running a roughly 300-1,000 reach simulation, ported from an older version of HSPF. When using hsp2 import all uci files and supporting wdm data are imported into an h5 database. Before even running the model, the resulting h5 file is nearly 200x larger than the component files. See Test 1-4 below for file details and commands.

This is not insurmountable, but at the time I think the only way this simulation would be doable on our current system is to be very aggressive with disk management, and essentially deleting the h5 file after running, post-processing, and extracting output data of interest into text files. Interested in your thoughts on how one might optimize this (if there is a way) @aufdenkampe

Questions:

Does the wdm/uci file source matter?
- Test 1: 135x inflation - test10.h5 from respec repo yields 27M from 200k of source files. See Test 1 below for details.
- Test 2: 142x inflation - cbp Phase 5.3.2 river RCHRES only. See Test 2 below for details.
- Test 3: 23x inflation - cbpPhase 5.3.2 land PERLND only. See Test 3 below for details.
- Conclusion: yes, wd,/ucifile source may matter, as 1 of 3 tests had significantly less inflation.
Is this due to the installed version of hdf5? See: https://github.com/respec/HSPsquared/issues/36
- Compiled 1.13.1 from source on deq1:
- repeated import_uci with CBP model (999M) and test10 (27M) from RESPEC- Does the UCI/WDM of interest matter?
- wdm + uci file are ~200k, resulting h5 file is 27M, so increase of 135x
- Conclusion: No.
Does running the model matter?
- No. All models were run after initial tests and h5 file sizes were generally less than 1% increased in size.

Test 1: Importing test10.uci with hdf5 version 1.10.4

hsp2 import_uci test10.uci test10.h5
ls -lhrt
total 29M
-rw-rw-r-- 1 rob rob 160K Jun 17 13:00 test10.wdm
-rw-rw-r-- 1 rob rob  36K Jun 17 13:00 test10.uci
-rw-rw-r-- 1 rob rob  27M Jun 17 13:02 test10.h5

Test 2: Importing UCI from Chesapeake Bay Model 5.3.2 single RCHRES only (no PERLND). Download files to test from: https://github.com/HARPgroup/HSPsquared/tree/master/tests/test_cbp_river

rm OR1_7700_7980.h5
hsp2 import_uci OR1_7700_7980.uci
ls -lhrt
-rwxrwxr-x 1 rob rob 480K Jun 13 20:40 prad_A51037.wdm
-rwxrwxr-x 1 rob rob 4.3M Jun 13 20:40 met_A51037.wdm
-rwxr-xr-x 1 rob rob  80K Jun 13 20:41 ps_sep_div_ams_p532sova_2021_OR1_7700_7980.wdm
-rw-rw-r-- 1 rob rob  11K Jun 15 16:19 OR1_7700_7980.uci
-rwxr-xr-x 1 rob rob 1.7M Jun 15 20:24 OR1_7700_7980.wdm
-rw-rw-r-- 1 rob rob 999M Jun 17 11:52 OR1_7700_7980.h5

Test 3: Importing UCI from Chesapeake Bay Model 5.3.2 single PERLND only (no RCHRES). Download files to test from: https://github.com/HARPgroup/HSPsquared/tree/master/tests/test_cbp_land

rm forA51800.h5
hsp2 import_uci forA51800.uci
ls -lhrt 
-rwxrwxr-x 1 rob rob 4.3M Jan 14 15:56 met_A51800.wdm
-rwxrwxr-x 1 rob rob 520K Jan 14 15:56 prad_A51800.wdm
-rwxr-xr-x 1 rob rob  14M Jan 14 15:57 forA51800.wdm
-rw-rw-r-- 1 rob rob 7.0K Jan 18 14:43 forA51800.uci
-rw-rw-r-- 1 rob rob 441M Jan 21 17:59 forA51800.h5

Test 4: Importing test10.uci with hdf5 version 1.13.1 - same file size as with version 1.10.4.

rm test10.h5
hsp2 import_uci test10.uci test10.h5
ls -lhrt 
-rw-rw-r-- 1 rob rob 160K Jun 17 13:00 test10.wdm
-rw-rw-r-- 1 rob rob  36K Jun 17 13:00 test10.uci
-rw-rw-r-- 1 rob rob 2.0M Jun 17 13:00 TEST10_hsp2_compare.ipynb
-rw-rw-r-- 1 rob rob  27M Jun 17 13:02 test10.h5

PaulDudaRESPEC commented 2 years ago

@rburghol , I concur that the h5 file size deserves more attention.

At this stage of development I've personally focused on making sure the numbers produced by HSP2 are equivalent to those produced by HSPF (within certain thresholds, etc.), and to that end for my test simulations I've written all possible output timeseries to h5 (using saveall) for HSP2 and all possible output timeseries to hbn for HSPF. When doing that kind of test, the file sizes from HSP2 and HSPF are within the same order of magnitude.

FWIW, I also note that importing the UCI by itself, with no timeseries data, results in a relatively large h5 file.

All that's to say, focusing on the h5 file size issue would be a worthy pursuit.

rburghol commented 2 years ago

Thanks for the follow up @PaulDudaRESPEC -- we are taking a deep dive into this as we learn the system and there appears to be some promising avenues, I will update here if we have anything of note.

I concur with your assessment about just importing the UCI, the majority of size expansion happens there in my tests as well.

One thing we are going to look into is whether hdf5 has options to enable or disable indexing when creating new datasets. In my experience with other databases, sometimes having numerous indexes can multiply table storage greatly, and perhaps there are defaults that do this in the case of hsp2 storage.

Or maybe it's something else :)

aufdenkampe commented 1 year ago

@rburghol & @PaulDudaRESPEC, the enormous size of HDF5 files are due to the fact that @rheaphy turned off compression back in early 2020, in large part because the HDFView desktop software can't view compressed data values but also because in early 2020 there were a number of real issues with HDF5 v1.10.x and with our PyTables library compatibility with newer HDF5 versions. For details on that history, see:

36

The good news is that since January 2022 we now have solutions to all those issues.

Our I/O abstraction refactoring makes it much easier to turn HDF5 compression back on:
- 59
Our new 2_ExploreResultsHDF5.ipynb tutorial provides guidance on how to explore the HDF5 file and its values from Jupyter, without needing to use HDFView.
- The Python libraries we use in the tutorial all decompress on the fly, so compression does not impede viewing data
PyTables v3.7 was released in late December, updating support to Python 3.10, numpy 1.21, HDF5 1.12, and Blosc 1.21.1 (for compression).
- HSP2 relies on PyTables for all HDF5 I/O, but the previous release was in 2019 and only supported HDF5 10.6 and Python 3.8, and an older version of numpy, so we had been stuck using older, versions of all that, which introduced lots of issues.

Unfortunately, our pause in funding means that we never had an opportunity to fully implement the HDF5 improvements that became available in January 2022.

If you are interested in exploring the benefits of the newer PyTables, I encourage you to install from our development environment, environment_dev.yml, which updates PyTables and other key libraries, and which would open the door to turning on HDF5 compression.

ptomasula commented 1 year ago

You should be able to leverage compression with the code in its current state. You just need to set the jupyterlab argument to True in your call to hsp2.main. That's probably poor naming and should be refactored to something more logical like compression, but the jupyterlab argument of main is what is being passed to the compress argument of save_timeseries.

aufdenkampe commented 1 year ago

As an interesting aside, HDF5 1.12.2 released in April 2022 added substantial improvements to parallel compression performance and memory usage. I'm hoping we can leverage that one way or another.

rburghol commented 1 year ago

Thanks to all this is a really great bit of info/progress. I will keep posted as I do testing on the file size/compression issue.

PaulDudaRESPEC commented 7 months ago

In addition to the compression issues discussed on this thread, we have now implemented an enhancement to control the output time step -- using the BINARY-INFO table to specify aggregation of the output time series to daily, monthly, or annual. The first cut of this enhancement is available in the develop branch, here: https://github.com/respec/HSPsquared/commit/4431180ebf8f61b8a24e7572f8f76d8ffb13f55d

respec / HSPsquared

hdf5 file sizes are enormous #94

36

59