m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
438 stars 201 forks source link

Datasets: customize HDF5 storage #1545

Open airwoodix opened 4 years ago

airwoodix commented 4 years ago

ARTIQ Feature Request

Problem this request addresses

The discussion on #1544 shows that an API for accessing full functionality of the underlying HDF5 storage should be designed. This not only enables dataset compression but also other optimized / customizable storage (#1345).

Things to keep in mind:

Possible solution

Extend the OO approach in HasEnvironment and the state machine in worker_impl. Principle code:

class HasEnvironment:
    def get_hdf5_handle(self, mode="a", **kwds):
        return h5py.File(hdf5_fname(self), mode, **kwds)

    # probably rarely overriden
    def write_metadata(self, f):
            f["rid"] = rid
            # etc.

    # probably rarely overriden
    def write_datasets(self, f):
            if "datasets" in f:
                group = f["datasets"]
            else:
                group = f.create_group("datasets")

            for name, value in self.__dataset_mgr.local.items():
                self.store_dataset(f, name, value)

    # probably rarely overriden
    # can be called for checkpointing
    # is called by the worker if run() fails and after analyze()
    def write_results(self):
        with self.get_hdf5_handle() as f:
            self.write_metadata(f)
            self.write_datasets(f)

    def store_dataset(self, f, name, value):
         if name in f["datasets"]:
            f["datasets"][name][()] = value
        else:
            f["datasets"][name] = value
        return f["datasets"][name]

class NoSaveExperiment(EnvExperiment):
    def get_hdf5_handle(self, mode="a", **kwds):
        return h5py.File("dummy.h5", mode, "core", backing_store=False)

class CompressArrays(EnvExperiment):
    def store_dataset(self, f, name, value):
        if name in f["datasets"]:
            f["datasets"][name][()] = value
            return f["datasets"][name]

        if isinstance(value, np.ndarray):
            return f["datasets"].create_dataset(name, data=value, compression="gzip", compression_opts=9)
        else:
            return super().store_dataset(f, name, value)

class ImageAttributes(EnvExperiment):
    def store_dataset(self, f, name, value):
        if "image" not in f["datasets"]:
            f["datasets"].create_dataset("image", data=self.get_dataset("image", archive=False), compression="gzip")

        if name.startswith("image."):
            attrname = name.split(".", maxsplit=1)[1]
            f["datasets"]["image"].attrs[attrname] = value
        elif name != "image":
            super().store_dataset(f, name, value)

# etc.

Pros:

Cons:

The above concept is still very rough. In particular, store_dataset overrides can become very quickly very verbose and helpers matching common cases should be provided.

Comments (fully orthogonal ideas) most welcome!

xiaosahnzhu commented 1 year ago

how to read hdf5's global_attributes?thank you