VDS SWMR - Ok for Round Robin

It looks like the new hdf5 1.10 virtual dataset features works like this

Define the mapping source datasets and the view datasets in the dataset creation properties
create the view dataset

We would like to use VDS and SWMR for DAQ data. With SWMR, you have to

finish all dataset creation property operations before anyone else can read it.

For our LCLS II DAQ operation, I'm imagining these steps

set up the datasets in the "RAW" files
use the standard Hdf5 call to let the master process, or analysis programs know they can start reading RAW files using SWMR
start appending to the RAW datasets

Likewise, the master process is going to get a chance to take a look at the datasets in the RAW files, and then

setup a mapping to a virtual dataset
use the same standard Hdf5 call to let the analysis programs know they can start reading from the master views using SWMR

(Note, I don't have a prototype of getting this to work yet, still chasing down bugs)

The issue I'm seeing is that, during live data taking, we don't know ahead of time what kind of view we want to setup, that is the mapping between the source datasets and the master dataset.

This is the kind of data that the current VDS/SWMR features look like they will support:

RAW File 0:
  /detectorA/time = [1,3,5,7,...]
  /detectorB/time = [1,3,5,7,...]

RAW File 1
  /detectorA/time = [0,2,4,6,...]
  /detectorB/time = [0,2,4,6,...]

That is, we have two different detectors that are distributed across two files in a predictable, round robin fashion. Since we know how all future data will look, we can create a nice, time aligned master view that will look like

MASTER:
  /detectorA/time = [0,1,2,3,4,5,6,7,...]
  /detectorB/time = [0,1,2,3,4,5,6,7,...]

This will involve a relatively small number of Hdf5 function calls to map the entire, future, datasets of the RAW files to strided (stride=2) selections of the master virtual view.

Aligning Varying Detector Rates

We'd like to align datasets, that is we'd like

MASTER/detectorA/time[i] == MASTER/detectorB/time[i]

So far, this looks like it will work fine, but one challenge to aligning, is that detectors can fire at different rates. That is the RAW files may look like

RAW File 0:
  /detectorA/time = [1,3,5,7,...]
  /detectorB/time = [2,6,10,14,...]

RAW File 1
  /detectorA/time = [0,2,4,6,...]
  /detectorB/time = [0,4,8,12,...]

meaning detectorA is recorded on every shot, but detectorB on every other shot.

We'd like to create an efficient master view by mapping all of [1,3,5,...] of detectorB in the master to a 'null' value somewhere, I think we would do this by either

using a null fill value in the RAW files, or the master view
using VDS to map all the missing entries, i.e, [1,3,5,...] to one entry in a small, "null dataset"
maintaining some extra dataset somewhere that identifies where different detectors are missing

Messy Data

The problem is messy live data, the DAQ will have to drop one or both detectors from a given shot, if we have data like:

RAW File 0:
  /detectorA/time = [1,3,7,...] # note - time 5 missing
  /detectorB/time = [2,10,14,...]  # note, shot 6 missing

RAW File 1
  /detectorA/time = [2,4,6,...]    # note - time 0 missing
  /detectorB/time = [0,4,8,12,...]  # nothing missing

the master process will want to update the VDS mapping while it is looking at the RAW files, that is ideally, we can create a nice view of data while we record it into the RAW files.

That is we have will want, for detectorA,

MASTER:
  /detectorA/time = [0->NULL,
                               1->file:0::row0,
                               2->file:1::row0,
                               3->file:0::row1,
                               4->file:1::row1,
                               5->NULL,
                               6->file:1::row2,
                               7->file:0::row2
                               ...

that is, we won't be following a stride pattern, and won't know what rows of the master virtual view are mapped to what rows of the file:0 file:1 datasets until we read them.

Dataset vs. Metadata VDS Implementation

Ultimately, I think we are looking for a 'dataset' implementation of the VDS - know it is a 'metadata' implementation - meaning you map everything out ahead of time, flush the 'meta-data' cache, and then it should work through SWMR.

You can imagine implementing a the messy VDS described above by writing a datasets in the master file with information like detectorA/time=3, use file 0, and row 1. Then an application the knows this schema can chase through to the correct location in file 0, file 1 etc. I think ideally, this would be a feature of Hdf5.

slaclab / lc2-hdf5-110

VDS and "messy" live data - do we need to create virtual mapping during SWMR writing? #1

VDS SWMR - Ok for Round Robin

Aligning Varying Detector Rates

Messy Data

Dataset vs. Metadata VDS Implementation