lspestrip / striptease

Strip TEst Analysis for System Evaluation
MIT License
4 stars 3 forks source link

Implementation of a HDF5 database #60

Closed ziotom78 closed 2 years ago

ziotom78 commented 2 years ago

This PR implements a HDF5 database in the DataStorage class, which keeps track of all the HDF5 files saved in a directory. It recursively walks sub-directories to find all the HDF5 and create an in-memory index with all the timing information.

The DataStorage class breaks the barrier between consecutive files, allowing the caller to load scientific data, housekeeping data, and tags regardless of whether they are in one or multiple files. Here is an example:

from striptease import DataStorage

ds = DataStorage("/database/STRIP/HDF5/")
# Caution! One whole day of scientific data!
times, data = ds.load_hk(
    mjd_range=(59530.0, 59531.0),
    group="BIAS",
    subgroup="POL_R0",
    par="VG1_HK",
)

The tool relies on a proper value of FIRST_SAMPLE and LAST_SAMPLE, which keep the MJD of the first and last sample in a HDF5 file. Unfortunately, it seems that at the moment the data server does not compute these fields, which are always set to zero. If the value is invalid, DataStorage uses some heuristics to determine them, but it is a time-consuming process that requires several seconds per file.

For this reason, the PR includes a new file, fix_hdf5.py, which performs this calculation and writes back the correct value in FIRST_SAMPLE and LAST_SAMPLE. The script can detect if a file was already fixed, and in this case it immediately skips it: this means that it should be safe to run the script on all the files of a directory over and over again (possibly in a crontab job) to make sure that all the files have the correct values in FIRST_SAMPLE and LAST_SAMPLE.