Question

If you checkout the vds_problems tag in the above link, and cd into vds_fixed, and do

python driver.py 10000

the driver will make two h5 files with h5py,

srcA.h5
srcB.h5

each has one dataset of 10000 short's (since we passed 10000 to driver.py)

Then it compiles and runs vds_fixed.cpp. This program creates a file master.h5. This file has one dataset, a virtual dataset. It was created to have 20000 elements, all of srcA maps to the even elements, and all of srcB to the odd elements.

1) while each of the files srcA.h5 and srcB.h5 are 22k in size, master.h5 is 157k. This seems very large. master.h5 keeps growing in size as I bump up the 10000 value for the driver.

2) More concerning, it takes 20 seconds to read the master.h5 virtual dataset of 20,000 shorts, using h5py, this seems quite long, maybe an h5py issue?

Reply

I received word back from the developer. It looks like you may be running into a known "inefficiency" in the code. He says the following:

"I suspect what is going on here is we are running into the inefficient 'version 1/legacy' encoding for regular hyperslabs (used originally for region references). Instead of the obvious method of storing just the start/stride/count/block, it stores an entry for each block, leading it to use lots of space when the blocks are small. The extra time to read is probably due to having to read the large selection, rebuild it, and check if it is regular. If he were to use an unlimited selection these issues should go away, since those are implemented the first way I described. Let me know if he tries this and still has problems."

"Changing the file format to use the efficient storage method in the general case was one of the 'wishlist' items from the original VDS implementation that was not implemented due to time/money constraints. Other things included using 64 bit values for selection encoding (the 32 bit values have since been causing problems), robust source file name resolution, and a cache for holding open some (but not all) of the source datasets."

I asked what "using an unlimited selection" meant ...

When defining the selection, you can use H5S_UNLIMITED in the count or block of a hyperslab (this is a new feature introduced for VDS). He believes the virtual space should have H5S_UNLIMITED in the count field, and the source space should have H5S_UNLIMITED in the block field. This indicates that the selection should grow to span the entire source dataset, while following the same pattern. The extent of the VDS probably needs to be unlimited in the same dimension as well.

slaclab / lc2-hdf5-110

VDS Performance & Using Unlimited #5

Question

Reply