Closed aufdenkampe closed 2 years ago
In Bob's June 12 "HSP2 Status" email, he writes in response to PaulD's issues with viewing results tables in HDFView:
In order to improve the performance, I switched to a faster HDF5 read/write method for computed results and added the blosc compression (which significantly speeds things up.) I believe that the slow down is due to the extra data movement in the new Numba dictionaries - which I hope is improved soon. A part of the slow down is also due to https://github.com/numba/numba/issues/5713 which is being worked on: "List() and Dict() always recompile _getitem,_setitem, _length, etc. every time; maybe should cache? #5713"
- I can easily add an option to the main program which writes the old way, but leaves the faster I/O as the default
- You can modify the line doing the results writing (main(), about line 166) as a temporary patch:
df.to_hdf(store, path, complib='blosc', complevel=9)
todf.to_hdf(store, path, format='t', data_columns=True)
- I can revert to the old behavior, but at a very significant slow down at this time.
- I can (and will) see if I can't combine the old HDF5 format with compression and perhaps gain some back. But HDFView will not work on the computed results. (below)
When compression is used, the HDFView program will not show the data unless the compression algorithm is registered to the HDFView code. I thought this would be too complicated for the ordinary users. I had hoped the utilities like ViewPERLND, ViewIMPLND, and ViewRCHRES would be the way ordinary users could view and analyze their results. I am working on a convenience function to extract data (with optional aggregation to common values such as to day, month, year, etc.) without the user needing more than the name of the data and segment ID. The View utilities would be updated to use this new tool.
Note: When using Pandas to read and write HDF5 data, the compression is essentially invisible to the user since Pandas (via Pytables internally) looks at the HDF5 metadata it wrote to the associated dataset to determine the format and optional compression algorithm used so it can extract the data appropriately.
I am also looking at the feasibility to continue the work that stopped on this JupyterLab project when the author's employer hit him with critical deadlines on his day job. https://github.com/jupyterlab/jupyterlab-hdf5 It doesn't work with the Pandas DataFrame-like display and needs to be updated to Juptyerlab 2+. I had put this as lower priority than adding the water quality modules. I was considering a second possibility to provide a routine that displayed Pandas DataFrames (and Pandas Series) in Jupyter Lab with a grid tool to allow easy editing of data. (like a vastly improved Qgrid). It would be based on the Phosphor/Lumino data grid.
Please provide feedback for the best option. Thanks!
From Bob's July 13 "HSP2 Status" email:
Status Update:
- HSP2 Performance problem understood,
- HSP2 punch list nearly complete - hoping for feedback,
- I plan to start work on all water quality modules by mid-week.
Details:
PERFORMANCE PROBLEMS
When I completed the rewrite of HSP2 to Python 3, I was shocked by the performance degradation. Previously, HSP2 was only slightly slower than HSPF after the first run in a session forced the Numba compilation. The first run in a new session was significantly worse since only some of the Numba compiled code was cached for reuse and need some recompilation. Numba code changes after the first HSP2 version included much better caching, but this required making changes to HSP2 code. I expected significant performance improvements in the new version, but it was 2 - 3x slower. I now understand the reason which was not clear from the profiling tools.
HDF5 released a significant new version 1.10.x which added welcome features like reusing freed space (within a session.) Unfortunately, it came with a very significant slowdown! (Up to a factor of 3x slowdown!) I finally found HDF5 presentations last week which explained this problem and I am attaching a slide from one of these presentations. Note: For some reason, the HDFGroup major releases are bumped by 2, so for the versions go from (say) 1.6.x to 1.8.x to 1.10.x to 1.12.x which they simply call 1.8 etc. in their presentations.
HDF5 is automatically installed (with a python wrapper) when installing pytables or h5py. The user is not given a choice of which HDF5 DLL version is used since the HDF5 API does change between versions (even minor versions prior to 1.10). The original HSP2 used HDF5 version 1.8 which generally was only slightly slower than earlier versions like 1.6. Apparently, some projects needing the many new features found in 1.10 release - and providing grants for this work - accepted the slow down, temporarily. Since HDF5 is critical in supercomputers and projects with very large data, it had always been taken for granted that performance would be only slightly different between versions since data storage and retrieval can never be fast enough.
When I rebuild HSP2, pytables had been updated and now used 1.10.5 which suffers from the terribly performance hit. The Python profilers can't look into the actual C language HDF5 DLL so the slowdown was shown only by the routines in pytables which made HDF5 calls. But pytables is actually installed and called by Pandas - so it wasn't clear where the problems was located. Pandas has gone through very significant code development between the first and second HSP2 code so I had assumed that the problems was in due to changes in Pandas since they have frequent, but temporary, performance problems and also due to Numba which made many changes as well.
Although I was able to improve HSP2 performance by BLOSC compression and changing from storage by pytable tables to raw storage, there was still a significant gap between the original HSP2 and the new version. (But at least it is acceptable for now.)
I just installed a new version of pytables (and h5py) using HDF5 1.10.6 which claims some performance improvement. See https://www.hdfgroup.org/2019/12/release-of-hdf5-1-10-6-newsletter-170/ Of course, it took time for pytables and h5py to update to this new version. By their standards all the data saved by HSP2 is probably "small", so we might benefit. But this needs testing. I plan to spend the first part of this coming week checking if this will make any significant improvement.
The original HSP2 used both h5py and pytables (pytables via Pandas). I stopped this during the development of the new version because the number of user warnings was alarming if both were used in the same program. It had not started out this way. I now understand it was because pythables and h5py didn't always install the same HDF5 DLL version. This was such a problem that h5py stopped using the globally accessible HDF5 DLL, but installed the DLL privately inside its own code package. So I can (possibly, I hope) use both in HSP2 again. The reading of the HDF5 file table of contents (the "keys") was much slower in pytables, so I can possibly gain more performance by getting this information using h5py. I had also used h5py to write arbitrary text data into the HSP2 HDF5 file. So I might be able to do this again. (Python stopped using ASCII characters in Python 3 and uses utf8 encode Unicode. HDF5 doesn't do well with Unicode, so both h5py and pytables do some workarounds to make this possible. This might make the data stored and retrievable, but not visible to HDFView which is not ideal.)
HDFGroup had originally announced that the 1.12 HDF5 version would use Unicode (utf8 encoded), but they were not able to make this work in the original 1.12 release. This would make everyone's life much easier if/when they get this done. But again, they got a big grant for some other important features, so they delivered those and deferred the Unicode issue which they say is difficult. They seem to be working on HDF5 1.10 and 1.12 in parallel now.
I have researched how to determine which version of HDF5 is used and I will now print this out with the other software version numbers after each HSP2 run. This will make future tracking of this issue easier. It is not possible to determine which version of HDF5 was used to write the data which is a "feature" since the HDF Group advertises the forward and backward compatibility between versions (with asterisks in their presentation slides.)
The HDFGroup has released new versions of 1.10 and 1.12, so their stated short term goals should focus on performance improvement.
PUNCH LIST
I have only a few items in my punch list for tagging a HSP2 version 1.0 alpha. I was waiting to make a check-in with any feedback about anything to make HSP2 better. I will defer the tagging until I get the HDF5 1.10.6 work complete - so it is NOT TOO LATE to provide feedback.
WATER QUALITY MODULES
I plan to start work on all modules next week (depending on the time it takes for the HDF5 1.10.6 work.) The first phase is to modify each file to get its data from the HDF5 file. It is possible that I might discover the need for some slight modifications to the HPS2 HDF5 file format (and to readUCI module to create this), but I hope I got most of this done correctly already. I plan for this pass to take one day per module. The second pass will focus on removing any Python errors or warnings and bring the module to the reasonable readability and maintainability. This pass will also take about 1 day per module. (There are about 18 modules in the development branch.)
I will need the new test cases to check this work in the third pass. The sooner the better.
Any priorities with these modules? Can some be deferred initially?
@rheaphy, thanks for all your deep sleuthing and hard work to figure out these performance issues.
I just noticed that HDF5 1.10.7 was released on Sep. 15, boasting additional performance improvements and full backward compatibility back to v1.10.3. See:
The last release of H5py (v2.10) was Sep. 6, 2019, but it looks like they're about to release v3.0 any day now based on https://github.com/h5py/h5py/issues/1673.
@rheaphy & @steveskrip
h5py 3.0 was just released, and it has a number of very nice performance features. See https://docs.h5py.org/en/latest/whatsnew/3.0.html.
Performance improvements include:
So HDF5 1.10.7 could get us back closer to v1.8 performance, but v1.12 may have other advantages that make it worthwhile.
NOTE: The new h5py 3.0 might require some recoding, given this feature change:
@rheaphy, I've recently learned that there are now several newer high performance data storage formats that have equal to better performance than HDF5 and are much better suited to cloud applications. See HDF in the Cloud: challenges and solutions for scientific data.
The most established of these is Parquet, which beats HDF5 in most metrics in this blog: The Best Format to Save Pandas Data. Parquet is integrated nicely with Pandas. However, I don't think it handles multi-dimensional data very well, although I'm not sure we strictly need that. It's even possible that breaking up the input/output into multiple parquet files might have an advantage, including for storage on GitHub.
The Pangeo geoscience big data initiative, has moved toward converting netCDF files to Zarr format. See Pangeo's Data in the Cloud page. Pangeo's Xarray library is designed as a multi-dimensional equivalent to Pandas, and seamlessly reads/writes netCDF & Zarr formats, in addition to working directly with pulling data from NOAA THREDDS data servers, such as those used for distributing climate data and National Water Model outputs.
Last, we are starting to explore the use of Dask to parallelize our data engine systems. Dask is another core library of Pangeo, and it works well with Numba. I don't think this is necessarily the next step for HSP2, but I think it is where we might want to head and it's worth mentioning sooner than later so that we can start moving in the right direction.
With following, we believe that we've addressed most of the problems described in this issue.
environment_dev.yml
file introduced in commit 1445a7618ce72d1bb873b7ec6d18a6cd82fb0599
@rheaphy emailed with his 2020-05-24 "HSP2 status update":
These some of these recent commits are https://github.com/respec/HSPsquared/pull/32/commits/cca2b0cc6240e60204a5f3baf06938cc4aa3e816, https://github.com/respec/HSPsquared/pull/32/commits/d154e55ac3f2c6140443262a08d9f1d3abe53470, and https://github.com/respec/HSPsquared/pull/32/commits/e92c035536c124a1794d2458e0b9c7ccc0bb8244.