Investigate performance of flat vs deep HDF5 hierarchies

jrieke commented 10 years ago

In the meeting @tarelli and I had with the group from the Wójcik lab who work on NSDF (Neuroscience Simulation Data Format), they pointed out that HDF5 is less performant when storing datasets in an arbitrary group hierarchy (as in Geppetto's format) vs a flat hierarchy (as in NSDF).

jrieke commented 10 years ago

Pfeiffer et. al. 2012 ran performance tests of storing variable information and metadata (not the simulation data itself) in a deep hierarchy (variable hierarchy maps 1:1 to HDF5 hierarchy) vs in a flat hierarchy (all information is stored in a single or few datasets). For 7000 variables, their results were:

file size: deep is up to 2 times as big as flat
reading all variables in matlab: deep takes 5.5 to 75 s (with different algorithms), flat only 0.1s

tarelli commented 10 years ago

@jrieke really interesting, thanks for doing that! What was the outcome of that paper? Did the effort lead to any standard for time series we are not aware of?

jrieke commented 10 years ago

I ran several performance tests myself now. Underlying model was the (fictional) leisure center with different pool tables and balls, @tarelli has sketched down for demonstration purposes here.

Setup: 200 tables, 200 balls per table, 1000 time steps, only x position. The time series were created with numpy.random.rand before measuring the runtime.

HDF5 file structure of the deep hierarchy /leisurecenter /poolroom /table0 /x (1D array of time) /table1 /x (1D array of time) ...

HDF5 file structure of the flat hierarchy /data /leisurecenter.poolroom.table0.balls.x (2D array of balls and time) /leisurecenter.poolroom.table1.balls.x (2D array of balls and time) ...

Note: All balls on one table make up one population (as it is called in NSDF), assuming they were simulated with the same timestep

Results

	deep hierarchy	flat hierarchy
Writing the entire file	ca. 11 s	ca. 0.2 to ca. 0.5 s*
File size	359 MB	305 MB
Reading the entire file	ca. 13 s	ca. 0.4 s

* Depends on whether the time series are already present in a 2D array of (balls, time) or whether this 2D array needs to be constructed from 1D arrays of (time) for each ball.

Conclusion The flat hierarchy is orders of magnitude faster than the deep hierarchy, both in reading and writing. Some additional time will be required to store the model hierarchy in the flat case, but that should not take too long. Concerning file size, there is not much difference yet, but Pfeiffer et. al. 2012 (see above) state in section 4.2 that compressing the file does not really work with a deep hierarchy.

@tarelli I think this really points us into the direction of a flat hierarchy, either by incorporating NSDF directly or at least by making a similar file structue (probably for non-neuroscience data). What do you think?

jrieke commented 10 years ago

@tarelli Well, they have defined a standard, but I think it's not really applicable for us:

seems to be designed for engineering stuff (Modelica, Simulink and such things)
quite a big overhead in terms of storing data types, variable information, ...
they were only quoted 3 times since 2012 so I would not really consider it a 'standard' ;)
~~as far as I have seen, they have not API available whatsoever~~ I found something here, will look at it later
edit: they do not seem to support variable time steps

slarson commented 10 years ago

On the speed point; do we know what the performance requirement is in order to evaluate whether the (significant) performance improvement is worth it? Fast is great but if it is fast versus correct / generic, I think it is worth asking how badly we need fast. Otherwise I fear premature optimization.

jrieke commented 10 years ago

@slarson Although I cannot completely assess it from the Geppetto side, I just imagine that it is annoying for me as a Geppetto user if I need probably minutes to store a large file (I mean, 1000 time steps as above aren't that much yet, right) vs seconds with the flat hierarchy. @tarelli , can you bring in some more light?

tarelli commented 10 years ago

@slarson the improvements to the NSDF format should give both speed and genericity through arbitrary hierarchies that map to flat datasets.

slarson commented 10 years ago

@tarelli I think I'm missing something. My understanding is that this discussion is comparing the current flat format of NSDF to the hierarchies provided by HDF5 and looking at performance in either case. Are you suggesting some way of doing both flat and hierarchies, and if so, how?

tarelli commented 10 years ago

@slarson so, both the original format and NSDF are based off HDF5. In the original format the data was stored within the hierarchies themselves while in the revised NSDF format the hierarchy leafs have a sort of a pointer to the flat dataset so that you can find things exploring a hierarchy but that's just a sort of a hierarchical facade over a flat structure. @jrieke is in the process of evaluating this revised format for which we now have a first sample.

openworm / org.geppetto.recording

Investigate performance of flat vs deep HDF5 hierarchies #6