Improve performance of HDF5 read/write

@rheaphy emailed with his 2020-05-24 "HSP2 status update":

Most of the last 2 weeks was investigating the much slower run times of HSP2 compared with HSPF. Prior to the "new" HSP2, the old HSP2 was 1.6 times slower than HSPF. I had expected this difference to be much less with the new design. Instead it started out almost 4 times slower! Since Python, Pandas, Numpy and Numba had all changed significantly, it is very hard to understand where the slow down occurred. With yesterday's update, I had cut this to a bit above 2 times slower (depending on if the HDF5 file was already created or not.) Using faster write methods in HDF5 seemed to really speed things up - but caused warning messages. I never found any problem in either the HDF5 file or the run results when the warnings were visible. Since warning messages would bother our users, I rejected using the faster write methods to improve the run time. (I still keep the option of the faster write methods but disabling the warning messages as a last resort.) I believe the only difference between the fast and slow writing methods is if they flush() after every write or not.)

Basically, I started using BLOSC compression on all timeseries and computed results when storing into the HDF5 file. This cut the HDF5 file size almost in half as well. Since the newer HDF5 library keeps the HDF5 file from growing significantly with additional runs, this is great. (The old HDF5 library would really let the HDF5 file grow to huge sizes!) And no warnings. I did not compress the UCI like tables so that other tools like HDFView would display properly. While I could compress everything and still use HDFVIew if I register the compression library to the HDF library, I don't want to make our users do this themselves. So this is a compromise for now.

I suspect that the changes to Numba are primarily responsible since I now need to copy information from the regular Python dictionary containing UCI information to the Numba typed dictionary in order to transfer the data into the hydrology codes. I spent time reviewing the Numba team meeting notes and issues and found a related issue concerning the new Numba typed lists. The contributors to the discussion indicated this could impact the typed dictionary as well. The Numba team is investigating the issue, so I will wait for more information before I address this improvement direction. I will do other profiling tests to look for other possible places for the slow execution.

I still think I can make HSP2 nearly as fast as HSPF, but it will take more time. At least, it is still fast enough to use - again. I remember the early days when calleg took over 40 minutes to run instead of a little over a minute. (HSPF takes 32.2 seconds on my machine and worst case HSP2 runs now take 1 minute 23 seconds (clean start after restarting the Python kernel and creating a new HDF5 file) to 1 minute 19 seconds if kernel had previously run HSP2. Without Numba, HSP2 takes 13 minutes 25 seconds - so Numba does help a lot!

I see a lot of profiling in my future.

These some of these recent commits are https://github.com/respec/HSPsquared/pull/32/commits/cca2b0cc6240e60204a5f3baf06938cc4aa3e816, https://github.com/respec/HSPsquared/pull/32/commits/d154e55ac3f2c6140443262a08d9f1d3abe53470, and https://github.com/respec/HSPsquared/pull/32/commits/e92c035536c124a1794d2458e0b9c7ccc0bb8244.

In Bob's June 12 "HSP2 Status" email, he writes in response to PaulD's issues with viewing results tables in HDFView:

In order to improve the performance, I switched to a faster HDF5 read/write method for computed results and added the blosc compression (which significantly speeds things up.) I believe that the slow down is due to the extra data movement in the new Numba dictionaries - which I hope is improved soon. A part of the slow down is also due to https://github.com/numba/numba/issues/5713 which is being worked on: "List() and Dict() always recompile _getitem,_setitem, _length, etc. every time; maybe should cache? #5713"

I can easily add an option to the main program which writes the old way, but leaves the faster I/O as the default

You can modify the line doing the results writing (main(), about line 166) as a temporary patch: df.to_hdf(store, path, complib='blosc', complevel=9) to df.to_hdf(store, path, format='t', data_columns=True)

I can revert to the old behavior, but at a very significant slow down at this time.

I can (and will) see if I can't combine the old HDF5 format with compression and perhaps gain some back. But HDFView will not work on the computed results. (below)

When compression is used, the HDFView program will not show the data unless the compression algorithm is registered to the HDFView code. I thought this would be too complicated for the ordinary users. I had hoped the utilities like ViewPERLND, ViewIMPLND, and ViewRCHRES would be the way ordinary users could view and analyze their results. I am working on a convenience function to extract data (with optional aggregation to common values such as to day, month, year, etc.) without the user needing more than the name of the data and segment ID. The View utilities would be updated to use this new tool.

Note: When using Pandas to read and write HDF5 data, the compression is essentially invisible to the user since Pandas (via Pytables internally) looks at the HDF5 metadata it wrote to the associated dataset to determine the format and optional compression algorithm used so it can extract the data appropriately.

I am also looking at the feasibility to continue the work that stopped on this JupyterLab project when the author's employer hit him with critical deadlines on his day job. https://github.com/jupyterlab/jupyterlab-hdf5 It doesn't work with the Pandas DataFrame-like display and needs to be updated to Juptyerlab 2+. I had put this as lower priority than adding the water quality modules. I was considering a second possibility to provide a routine that displayed Pandas DataFrames (and Pandas Series) in Jupyter Lab with a grid tool to allow easy editing of data. (like a vastly improved Qgrid). It would be based on the Phosphor/Lumino data grid.

Please provide feedback for the best option. Thanks!

From Bob's July 13 "HSP2 Status" email:

Status Update:

HSP2 Performance problem understood,

HSP2 punch list nearly complete - hoping for feedback,

I plan to start work on all water quality modules by mid-week.

Details:

PERFORMANCE PROBLEMS

When I completed the rewrite of HSP2 to Python 3, I was shocked by the performance degradation. Previously, HSP2 was only slightly slower than HSPF after the first run in a session forced the Numba compilation. The first run in a new session was significantly worse since only some of the Numba compiled code was cached for reuse and need some recompilation. Numba code changes after the first HSP2 version included much better caching, but this required making changes to HSP2 code. I expected significant performance improvements in the new version, but it was 2 - 3x slower. I now understand the reason which was not clear from the profiling tools.

HDF5 released a significant new version 1.10.x which added welcome features like reusing freed space (within a session.) Unfortunately, it came with a very significant slowdown! (Up to a factor of 3x slowdown!) I finally found HDF5 presentations last week which explained this problem and I am attaching a slide from one of these presentations. Note: For some reason, the HDFGroup major releases are bumped by 2, so for the versions go from (say) 1.6.x to 1.8.x to 1.10.x to 1.12.x which they simply call 1.8 etc. in their presentations.

HDF5 is automatically installed (with a python wrapper) when installing pytables or h5py. The user is not given a choice of which HDF5 DLL version is used since the HDF5 API does change between versions (even minor versions prior to 1.10). The original HSP2 used HDF5 version 1.8 which generally was only slightly slower than earlier versions like 1.6. Apparently, some projects needing the many new features found in 1.10 release - and providing grants for this work - accepted the slow down, temporarily. Since HDF5 is critical in supercomputers and projects with very large data, it had always been taken for granted that performance would be only slightly different between versions since data storage and retrieval can never be fast enough.

When I rebuild HSP2, pytables had been updated and now used 1.10.5 which suffers from the terribly performance hit. The Python profilers can't look into the actual C language HDF5 DLL so the slowdown was shown only by the routines in pytables which made HDF5 calls. But pytables is actually installed and called by Pandas - so it wasn't clear where the problems was located. Pandas has gone through very significant code development between the first and second HSP2 code so I had assumed that the problems was in due to changes in Pandas since they have frequent, but temporary, performance problems and also due to Numba which made many changes as well.

Although I was able to improve HSP2 performance by BLOSC compression and changing from storage by pytable tables to raw storage, there was still a significant gap between the original HSP2 and the new version. (But at least it is acceptable for now.)

I just installed a new version of pytables (and h5py) using HDF5 1.10.6 which claims some performance improvement. See https://www.hdfgroup.org/2019/12/release-of-hdf5-1-10-6-newsletter-170/ Of course, it took time for pytables and h5py to update to this new version. By their standards all the data saved by HSP2 is probably "small", so we might benefit. But this needs testing. I plan to spend the first part of this coming week checking if this will make any significant improvement.

The original HSP2 used both h5py and pytables (pytables via Pandas). I stopped this during the development of the new version because the number of user warnings was alarming if both were used in the same program. It had not started out this way. I now understand it was because pythables and h5py didn't always install the same HDF5 DLL version. This was such a problem that h5py stopped using the globally accessible HDF5 DLL, but installed the DLL privately inside its own code package. So I can (possibly, I hope) use both in HSP2 again. The reading of the HDF5 file table of contents (the "keys") was much slower in pytables, so I can possibly gain more performance by getting this information using h5py. I had also used h5py to write arbitrary text data into the HSP2 HDF5 file. So I might be able to do this again. (Python stopped using ASCII characters in Python 3 and uses utf8 encode Unicode. HDF5 doesn't do well with Unicode, so both h5py and pytables do some workarounds to make this possible. This might make the data stored and retrievable, but not visible to HDFView which is not ideal.)

HDFGroup had originally announced that the 1.12 HDF5 version would use Unicode (utf8 encoded), but they were not able to make this work in the original 1.12 release. This would make everyone's life much easier if/when they get this done. But again, they got a big grant for some other important features, so they delivered those and deferred the Unicode issue which they say is difficult. They seem to be working on HDF5 1.10 and 1.12 in parallel now.

I have researched how to determine which version of HDF5 is used and I will now print this out with the other software version numbers after each HSP2 run. This will make future tracking of this issue easier. It is not possible to determine which version of HDF5 was used to write the data which is a "feature" since the HDF Group advertises the forward and backward compatibility between versions (with asterisks in their presentation slides.)

The HDFGroup has released new versions of 1.10 and 1.12, so their stated short term goals should focus on performance improvement.

PUNCH LIST

I have only a few items in my punch list for tagging a HSP2 version 1.0 alpha. I was waiting to make a check-in with any feedback about anything to make HSP2 better. I will defer the tagging until I get the HDF5 1.10.6 work complete - so it is NOT TOO LATE to provide feedback.

WATER QUALITY MODULES

I plan to start work on all modules next week (depending on the time it takes for the HDF5 1.10.6 work.) The first phase is to modify each file to get its data from the HDF5 file. It is possible that I might discover the need for some slight modifications to the HPS2 HDF5 file format (and to readUCI module to create this), but I hope I got most of this done correctly already. I plan for this pass to take one day per module. The second pass will focus on removing any Python errors or warnings and bring the module to the reasonable readability and maintainability. This pass will also take about 1 day per module. (There are about 18 modules in the development branch.)

I will need the new test cases to check this work in the third pass. The sooner the better.

Any priorities with these modules? Can some be deferred initially?

@rheaphy, thanks for all your deep sleuthing and hard work to figure out these performance issues.

I just noticed that HDF5 1.10.7 was released on Sep. 15, boasting additional performance improvements and full backward compatibility back to v1.10.3. See:

The last release of H5py (v2.10) was Sep. 6, 2019, but it looks like they're about to release v3.0 any day now based on https://github.com/h5py/h5py/issues/1673.

@rheaphy & @steveskrip

h5py 3.0 was just released, and it has a number of very nice performance features. See https://docs.h5py.org/en/latest/whatsnew/3.0.html.

Performance improvements include:

Reading & writing data now releases the GIL, so another Python thread can continue while HDF5 accesses data.
Numpy datetime and timedelta arrays can now be stored and read as HDF5 opaque data
Reading data has less overhead, as selection has been implemented in Cython. Making many small reads from the same dataset can be as much as 10 times faster...
Compatibility with HDF5 1.12.
- This comes with it's own performance enhancement over HDF5. 1.10.5 but potentially not as good as 1.10.7. (see https://github.com/h5py/h5py/issues/1673#issuecomment-697363823). This figure from HDFGroup's Compatibility and Performance Issues shows some of it.

So HDF5 1.10.7 could get us back closer to v1.8 performance, but v1.12 may have other advantages that make it worthwhile.

The developer of h5py says we can use h5py 3.0 with HDF5 10.7 if we build from source. See his comment here: https://github.com/h5py/h5py/issues/1673#issuecomment-697468981

NOTE: The new h5py 3.0 might require some recoding, given this feature change:

"The interface for storing & reading strings has changed - see Strings in HDF5. The new rules are hopefully more consistent, but may well require some changes in coding using h5py."

@rheaphy, I've recently learned that there are now several newer high performance data storage formats that have equal to better performance than HDF5 and are much better suited to cloud applications. See HDF in the Cloud: challenges and solutions for scientific data.

The most established of these is Parquet, which beats HDF5 in most metrics in this blog: The Best Format to Save Pandas Data. Parquet is integrated nicely with Pandas. However, I don't think it handles multi-dimensional data very well, although I'm not sure we strictly need that. It's even possible that breaking up the input/output into multiple parquet files might have an advantage, including for storage on GitHub.

The Pangeo geoscience big data initiative, has moved toward converting netCDF files to Zarr format. See Pangeo's Data in the Cloud page. Pangeo's Xarray library is designed as a multi-dimensional equivalent to Pandas, and seamlessly reads/writes netCDF & Zarr formats, in addition to working directly with pulling data from NOAA THREDDS data servers, such as those used for distributing climate data and National Water Model outputs.

Last, we are starting to explore the use of Dask to parallelize our data engine systems. Dask is another core library of Pangeo, and it works well with Numba. I don't think this is necessarily the next step for HSP2, but I think it is where we might want to head and it's worth mentioning sooner than later so that we can start moving in the right direction.

With following, we believe that we've addressed most of the problems described in this issue.

https://github.com/respec/HSPsquared/issues/59#issuecomment-998781008
commit 0ed2302f43efbb98e8ecc2f19177ef0b609b617f
68
The release of Pytables v3.7.0, which supports HDF5 v1.12.1, and which we'll start developing against with the new Python 3.9 environment_dev.yml file introduced in commit 1445a7618ce72d1bb873b7ec6d18a6cd82fb0599

respec / HSPsquared

Improve performance of HDF5 read/write #36

Status Update:

Details:

PERFORMANCE PROBLEMS

PUNCH LIST

WATER QUALITY MODULES

68