The discussion on #1544 shows that an API for accessing full functionality of the underlying HDF5 storage should be designed. This not only enables dataset compression but also other optimized / customizable storage (#1345).
Things to keep in mind:
505 (on-demand checkpointing)
1464 (save early)
1272 (merge "archive" and "datasets" groups should be done together with these changes)
we'd also like to have the possibility to not save any results (not even an empty HDF5 file)
Possible solution
Extend the OO approach in HasEnvironment and the state machine in worker_impl. Principle code:
class HasEnvironment:
def get_hdf5_handle(self, mode="a", **kwds):
return h5py.File(hdf5_fname(self), mode, **kwds)
# probably rarely overriden
def write_metadata(self, f):
f["rid"] = rid
# etc.
# probably rarely overriden
def write_datasets(self, f):
if "datasets" in f:
group = f["datasets"]
else:
group = f.create_group("datasets")
for name, value in self.__dataset_mgr.local.items():
self.store_dataset(f, name, value)
# probably rarely overriden
# can be called for checkpointing
# is called by the worker if run() fails and after analyze()
def write_results(self):
with self.get_hdf5_handle() as f:
self.write_metadata(f)
self.write_datasets(f)
def store_dataset(self, f, name, value):
if name in f["datasets"]:
f["datasets"][name][()] = value
else:
f["datasets"][name] = value
return f["datasets"][name]
class NoSaveExperiment(EnvExperiment):
def get_hdf5_handle(self, mode="a", **kwds):
return h5py.File("dummy.h5", mode, "core", backing_store=False)
class CompressArrays(EnvExperiment):
def store_dataset(self, f, name, value):
if name in f["datasets"]:
f["datasets"][name][()] = value
return f["datasets"][name]
if isinstance(value, np.ndarray):
return f["datasets"].create_dataset(name, data=value, compression="gzip", compression_opts=9)
else:
return super().store_dataset(f, name, value)
class ImageAttributes(EnvExperiment):
def store_dataset(self, f, name, value):
if "image" not in f["datasets"]:
f["datasets"].create_dataset("image", data=self.get_dataset("image", archive=False), compression="gzip")
if name.startswith("image."):
attrname = name.split(".", maxsplit=1)[1]
f["datasets"]["image"].attrs[attrname] = value
elif name != "image":
super().store_dataset(f, name, value)
# etc.
Pros:
no change to the dataset API (in particular not of the dataset_db which has surprisingly many hard-coded occurences in the codebase)
can easily define global behavior (e.g. CompressArrays, compress by name, change data layout, store some datasets as attributes like in ImageAttributes, etc.)
full HDF5 API available
Cons:
no persistence of the dataset options
no encapsulation of the dataset concept (not sure this is a disadvantage)
need to expose some internals. In particular, listing the datasets at the experiment level is required to help writing store_dataset overrides
the proper OO approach would have the datasets know how to save themselves and attributes be declared together with the dataset they characterize. This is however a major rework of the dataset API which would likely break existing code.
The above concept is still very rough. In particular, store_dataset overrides can become very quickly very verbose and helpers matching common cases should be provided.
ARTIQ Feature Request
Problem this request addresses
The discussion on #1544 shows that an API for accessing full functionality of the underlying HDF5 storage should be designed. This not only enables dataset compression but also other optimized / customizable storage (#1345).
Things to keep in mind:
505 (on-demand checkpointing)
1464 (save early)
1272 (merge "archive" and "datasets" groups should be done together with these changes)
Possible solution
Extend the OO approach in
HasEnvironment
and the state machine inworker_impl
. Principle code:Pros:
dataset_db
which has surprisingly many hard-coded occurences in the codebase)CompressArrays
, compress by name, change data layout, store some datasets as attributes like inImageAttributes
, etc.)Cons:
store_dataset
overridesThe above concept is still very rough. In particular,
store_dataset
overrides can become very quickly very verbose and helpers matching common cases should be provided.Comments (fully orthogonal ideas) most welcome!