slaclab / lc2-hdf5-110

Investigate hdf5 1.10 features like SWMR and virtual dataset for LCLS II
Apache License 2.0
0 stars 2 forks source link

How to properly handle vlen data with SWMR #2

Open davidslac opened 7 years ago

davidslac commented 7 years ago

While Hdf5 has a vlen type, you cannot use it with SWMR. an alternative is we use two datasets for vlen, one aligned with the shot that has two values in it, the start/stop positions in another dataset, the 'blob' dataset. I posted to the hdf5 forum here:

http://hdf-forum.184993.n3.nabble.com/vlen-data-and-SWMR-td4029470.html

Text below:

SWMR doesn't support vlen, and we want to make vlen data available while writing hdf5. Right now, I see a decent way to encode the vlen data in typical datasets, which I'll explain below. My question is, what is the best way to get vlen in SWMR? The way that will be easiest for users to work with the data? Working with or funding the hdf5 group to develop vlen for SWMR might be the answer (or maybe this feature is already in development?) However I think users find the vlen data types difficult to work with through h5py and Matlab. The real advantage though, is we could write a row based schema where each row corresponds to a shot. If our data acquisition system records data for three shots, and reduces each shot to a list of features, We can, if we had the vlen type in SMWR, write one dataset like:

DATA 
0 [a0, a1, a2] 
1 [b0, b1, b2, b3, b4] 
2 [c0, c1] 

that is, three rows (I'm labeling as 0 1 2) each corresponding to one of the three events, and all the features are there (a for event 1, b for event 2, etc).

To simulate vlen for SWMR, I'm thinking of two datasets, one is aligned with the shots, and it stores the range of where the features are in a 'blob' dataset, that is:

RANGE 
0 [0,3] 
1 [3,8] 
2 [8,10] 
BLOBDATA 
0 a0 
1 a1 
2 a2 
3 b0 
4 b1 
5 b2 
6 b3 
7 b4 
8 c0 
9 c1 

Then a h5py user does

r0,r1 = range_ds[1,:] 
features_event_1 = blobdata_ds[r0:r1] 

On the h5py side, the users is just dealing with numpy arrays of basic types, with the hdf5 vlen type, they have to work with a object based type introduced to handle vlen data -- it gets messy depending on what you are doing.

Similarly, on the matlab side, users, I think, have to mess with cell arrays which I don't think they have to do otherwise (I don't use matlab much).

One disadvantage of the two datasets, RANGE and BLOBDATA, is we have to choose between 0-up and 1-up counting. We'll do 0-up, but then the Matlab/Fotran/Julia users that use 1-up indexing have to adjust.