ARTIQ Feature Request

Problem this request addresses

The discussion on #1544 shows that an API for accessing full functionality of the underlying HDF5 storage should be designed. This not only enables dataset compression but also other optimized / customizable storage (#1345).

Things to keep in mind:

505 (on-demand checkpointing)
1464 (save early)
1272 (merge "archive" and "datasets" groups should be done together with these changes)
we'd also like to have the possibility to not save any results (not even an empty HDF5 file)

Possible solution

Extend the OO approach in HasEnvironment and the state machine in worker_impl. Principle code:

class HasEnvironment:
    def get_hdf5_handle(self, mode="a", **kwds):
        return h5py.File(hdf5_fname(self), mode, **kwds)

    # probably rarely overriden
    def write_metadata(self, f):
            f["rid"] = rid
            # etc.

    # probably rarely overriden
    def write_datasets(self, f):
            if "datasets" in f:
                group = f["datasets"]
            else:
                group = f.create_group("datasets")

            for name, value in self.__dataset_mgr.local.items():
                self.store_dataset(f, name, value)

    # probably rarely overriden
    # can be called for checkpointing
    # is called by the worker if run() fails and after analyze()
    def write_results(self):
        with self.get_hdf5_handle() as f:
            self.write_metadata(f)
            self.write_datasets(f)

    def store_dataset(self, f, name, value):
         if name in f["datasets"]:
            f["datasets"][name][()] = value
        else:
            f["datasets"][name] = value
        return f["datasets"][name]

class NoSaveExperiment(EnvExperiment):
    def get_hdf5_handle(self, mode="a", **kwds):
        return h5py.File("dummy.h5", mode, "core", backing_store=False)

class CompressArrays(EnvExperiment):
    def store_dataset(self, f, name, value):
        if name in f["datasets"]:
            f["datasets"][name][()] = value
            return f["datasets"][name]

        if isinstance(value, np.ndarray):
            return f["datasets"].create_dataset(name, data=value, compression="gzip", compression_opts=9)
        else:
            return super().store_dataset(f, name, value)

class ImageAttributes(EnvExperiment):
    def store_dataset(self, f, name, value):
        if "image" not in f["datasets"]:
            f["datasets"].create_dataset("image", data=self.get_dataset("image", archive=False), compression="gzip")

        if name.startswith("image."):
            attrname = name.split(".", maxsplit=1)[1]
            f["datasets"]["image"].attrs[attrname] = value
        elif name != "image":
            super().store_dataset(f, name, value)

# etc.

Pros:

no change to the dataset API (in particular not of the dataset_db which has surprisingly many hard-coded occurences in the codebase)
can easily define global behavior (e.g. CompressArrays, compress by name, change data layout, store some datasets as attributes like in ImageAttributes, etc.)
full HDF5 API available

Cons:

no persistence of the dataset options
no encapsulation of the dataset concept (not sure this is a disadvantage)
need to expose some internals. In particular, listing the datasets at the experiment level is required to help writing store_dataset overrides
the proper OO approach would have the datasets know how to save themselves and attributes be declared together with the dataset they characterize. This is however a major rework of the dataset API which would likely break existing code.

The above concept is still very rough. In particular, store_dataset overrides can become very quickly very verbose and helpers matching common cases should be provided.

Comments (fully orthogonal ideas) most welcome!

m-labs / artiq

Datasets: customize HDF5 storage #1545

ARTIQ Feature Request

Problem this request addresses

505 (on-demand checkpointing)

1464 (save early)

1272 (merge "archive" and "datasets" groups should be done together with these changes)

Possible solution