slaclab / lc2-hdf5-110

Investigate hdf5 1.10 features like SWMR and virtual dataset for LCLS II
Apache License 2.0
0 stars 2 forks source link

VDS and "messy" live data - do we need to create virtual mapping during SWMR writing? #1

Open davidslac opened 7 years ago

davidslac commented 7 years ago

VDS SWMR - Ok for Round Robin

It looks like the new hdf5 1.10 virtual dataset features works like this

  1. Define the mapping source datasets and the view datasets in the dataset creation properties
  2. create the view dataset

We would like to use VDS and SWMR for DAQ data. With SWMR, you have to

For our LCLS II DAQ operation, I'm imagining these steps

Likewise, the master process is going to get a chance to take a look at the datasets in the RAW files, and then

(Note, I don't have a prototype of getting this to work yet, still chasing down bugs)

The issue I'm seeing is that, during live data taking, we don't know ahead of time what kind of view we want to setup, that is the mapping between the source datasets and the master dataset.

This is the kind of data that the current VDS/SWMR features look like they will support:

RAW File 0:
  /detectorA/time = [1,3,5,7,...]
  /detectorB/time = [1,3,5,7,...]

RAW File 1
  /detectorA/time = [0,2,4,6,...]
  /detectorB/time = [0,2,4,6,...]

That is, we have two different detectors that are distributed across two files in a predictable, round robin fashion. Since we know how all future data will look, we can create a nice, time aligned master view that will look like

MASTER:
  /detectorA/time = [0,1,2,3,4,5,6,7,...]
  /detectorB/time = [0,1,2,3,4,5,6,7,...]

This will involve a relatively small number of Hdf5 function calls to map the entire, future, datasets of the RAW files to strided (stride=2) selections of the master virtual view.

Aligning Varying Detector Rates

We'd like to align datasets, that is we'd like

MASTER/detectorA/time[i] == MASTER/detectorB/time[i]

So far, this looks like it will work fine, but one challenge to aligning, is that detectors can fire at different rates. That is the RAW files may look like

RAW File 0:
  /detectorA/time = [1,3,5,7,...]
  /detectorB/time = [2,6,10,14,...]

RAW File 1
  /detectorA/time = [0,2,4,6,...]
  /detectorB/time = [0,4,8,12,...]

meaning detectorA is recorded on every shot, but detectorB on every other shot.

We'd like to create an efficient master view by mapping all of [1,3,5,...] of detectorB in the master to a 'null' value somewhere, I think we would do this by either

Messy Data

The problem is messy live data, the DAQ will have to drop one or both detectors from a given shot, if we have data like:

RAW File 0:
  /detectorA/time = [1,3,7,...] # note - time 5 missing
  /detectorB/time = [2,10,14,...]  # note, shot 6 missing

RAW File 1
  /detectorA/time = [2,4,6,...]    # note - time 0 missing
  /detectorB/time = [0,4,8,12,...]  # nothing missing

the master process will want to update the VDS mapping while it is looking at the RAW files, that is ideally, we can create a nice view of data while we record it into the RAW files.

That is we have will want, for detectorA,

MASTER:
  /detectorA/time = [0->NULL,
                               1->file:0::row0,
                               2->file:1::row0,
                               3->file:0::row1,
                               4->file:1::row1,
                               5->NULL,
                               6->file:1::row2,
                               7->file:0::row2
                               ...

that is, we won't be following a stride pattern, and won't know what rows of the master virtual view are mapped to what rows of the file:0 file:1 datasets until we read them.

Dataset vs. Metadata VDS Implementation

Ultimately, I think we are looking for a 'dataset' implementation of the VDS - know it is a 'metadata' implementation - meaning you map everything out ahead of time, flush the 'meta-data' cache, and then it should work through SWMR.

You can imagine implementing a the messy VDS described above by writing a datasets in the master file with information like detectorA/time=3, use file 0, and row 1. Then an application the knows this schema can chase through to the correct location in file 0, file 1 etc. I think ideally, this would be a feature of Hdf5.

davidslac commented 7 years ago

From correspondence with Barbara through help@hdfgroup.org:

I did hear back from the developer regarding SWMR and VDS...

He read the linked article, and says that for the "variable detector rates" scenario, VDS supports fill values for unmapped parts of the dataset, so you do not need to map unused spots to a NULL dataset.

The "messy" data implementation is similar to using a region reference dataset to point to elements in the source datasets, which will not perform well. There is more programming work that needs to be done to read through the reference dataset. However, while we could implement something to make this scheme more transparent, it will never perform well.

Unlimited VDS mappings can either be regularly sized blocks or a single extensible block.