pyiron / pyiron_workflow

Graph-and-node based workflows
BSD 3-Clause "New" or "Revised" License
12 stars 1 forks source link

Serialization for pyiron_node function fails #402

Closed JNmpi closed 1 month ago

JNmpi commented 2 months ago

I have updated pyiron_workflow to the latest version in main. When running

engine = Workflow.create.atomistic.engine.ase.M3GNet()

I get the following lengthy error message attached at the end. With my older pyiron_workflow version this worked but I got the recursion error when calling:

engine.save()

I guess that I get the error already when just creating the node function is related to the new cache functionality, which is internally doing something similar as the save statement.

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
Cell In[34], line 1
----> 1 engine = Workflow.create.atomistic.engine.ase.M3GNet()

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/nodes/static_io.py:39](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/nodes/static_io.py#line=38), in StaticNode.__init__(self, label, parent, overwrite_save, run_after_init, storage_backend, save_after_run, *args, **kwargs)
     28 def __init__(
     29     self,
     30     *args,
   (...)
     37     **kwargs,
     38 ):
---> 39     super().__init__(
     40         *args,
     41         label=label,
     42         parent=parent,
     43         overwrite_save=overwrite_save,
     44         run_after_init=run_after_init,
     45         storage_backend=storage_backend,
     46         save_after_run=save_after_run,
     47         **kwargs,
     48     )

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/node.py:353](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/node.py#line=352), in Node.__init__(self, label, parent, overwrite_save, run_after_init, storage_backend, save_after_run, *args, **kwargs)
    350 self._user_data = {}  # A place for power-users to bypass node-injection
    352 self._setup_node()
--> 353 self._after_node_setup(
    354     *args,
    355     overwrite_save=overwrite_save,
    356     run_after_init=run_after_init,
    357     **kwargs,
    358 )

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/node.py:395](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/node.py#line=394), in Node._after_node_setup(self, overwrite_save, run_after_init, *args, **kwargs)
    389 elif do_load:
    390     logger.info(
    391         f"A saved file was found for the node {self.full_label} -- "
    392         f"attempting to load it...(To delete the saved file instead, use "
    393         f"`overwrite_save=True`)"
    394     )
--> 395     self.load()
    396     self.set_input_values(*args, **kwargs)
    397 elif run_after_init:

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/mixin/storage.py:281](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/mixin/storage.py#line=280), in HasStorage.load(self)
    273 def load(self):
    274     """
    275     Loads the node file (from HDF5) such that this node restores its state at time
    276     of loading.
   (...)
    279         TypeError: when the saved node has a different class name.
    280     """
--> 281     self.storage.load()

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/mixin/storage.py:60](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/mixin/storage.py#line=59), in StorageInterface.load(self)
     57 def load(self):
     58     # Misdirection is strictly for symmetry with _save, so child classes define the
     59     # private method in both cases
---> 60     return self._load()

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/mixin/storage.py:177](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/mixin/storage.py#line=176), in TinybaseStorage._load(self)
    171 if tinybase_storage["class_name"] != self.owner.__class__.__name__:
    172     raise TypeError(
    173         f"{self.owner.label} cannot load, as it has type "
    174         f"{self.owner.__class__.__name__},  but the saved node has type "
    175         f"{tinybase_storage['class_name']}"
    176     )
--> 177 self.owner.from_storage(tinybase_storage)

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/node.py:870](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/node.py#line=869), in Node.from_storage(self, storage)
    868 data_outputs = storage["outputs"]
    869 for label in data_outputs.list_groups():
--> 870     self.outputs[label].from_storage(data_outputs[label])

File [~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/channels.py:499](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/python_projects/git_libs/pyiron_workflow/pyiron_workflow/channels.py#line=498), in DataChannel.from_storage(self, storage)
    494 self.default = storage["default"]
    495 from pyiron_contrib.tinybase.storage import GenericStorage
    497 self.value = (
    498     storage["value"].to_object()
--> 499     if isinstance(storage["value"], GenericStorage)
    500     else storage["value"]
    501 )

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/pyiron_contrib/tinybase/storage.py:365](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/pyiron_contrib/tinybase/storage.py#line=364), in H5ioStorage.__getitem__(self, item)
    364 def __getitem__(self, item):
--> 365     value = self._pointer[item]
    366     if isinstance(value, Hdf5Pointer):
    367         return type(self)(value, project=self._project)

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/pointer.py:396](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/pointer.py#line=395), in Pointer.__getitem__(self, item)
    394     h5_path_new = self._h5_path + item
    395 try:
--> 396     data_dict = read_dict_from_hdf(
    397         file_name=self._file_name,
    398         h5_path=h5_path_new,
    399         recursive=False,
    400     )
    401     if len(data_dict) > 1:
    402         return get_hierarchical_dict(
    403             path_dict={
    404                 k.replace(self._h5_path + "[/](http://localhost:8888/)", ""): v
    405                 for k, v in data_dict.items()
    406             }
    407         )

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py:92](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py#line=91), in read_dict_from_hdf(file_name, h5_path, recursive, slash)
     89     nodes_lst = [h5_path]
     90 if len(nodes_lst) > 0 and nodes_lst[0] != "[/](http://localhost:8888/)":
     91     return {
---> 92         n: _read_hdf(hdf_filehandle=hdf, h5_path=n, slash=slash)
     93         for n in nodes_lst
     94     }
     95 else:
     96     return {}

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py:283](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py#line=282), in _read_hdf(hdf_filehandle, h5_path, slash)
    270 """
    271 Read data from HDF5 file
    272 
   (...)
    280     object:     The loaded data. Can be of any type supported by ``write_hdf5``.
    281 """
    282 file_name = _get_filename_from_filehandle(hdf_filehandle=hdf_filehandle)
--> 283 return _retry(
    284     lambda: h5io.read_hdf5(
    285         fname=hdf_filehandle,
    286         title=h5_path,
    287         slash=slash,
    288     ),
    289     error=BlockingIOError,
    290     msg=f"Two or more processes tried to access the file {file_name}.",
    291     at_most=10,
    292     delay=1,
    293 )

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py:553](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py#line=552), in _retry(func, error, msg, at_most, delay, delay_factor)
    551 for i in tries:
    552     try:
--> 553         return func()
    554     except error as e:
    555         warnings.warn(
    556             f"{msg} Trying again in {delay}s. Tried {i + 1} times so far..."
    557         )

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py:284](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io_browser/base.py#line=283), in _read_hdf.<locals>.<lambda>()
    270 """
    271 Read data from HDF5 file
    272 
   (...)
    280     object:     The loaded data. Can be of any type supported by ``write_hdf5``.
    281 """
    282 file_name = _get_filename_from_filehandle(hdf_filehandle=hdf_filehandle)
    283 return _retry(
--> 284     lambda: h5io.read_hdf5(
    285         fname=hdf_filehandle,
    286         title=h5_path,
    287         slash=slash,
    288     ),
    289     error=BlockingIOError,
    290     msg=f"Two or more processes tried to access the file {file_name}.",
    291     at_most=10,
    292     delay=1,
    293 )

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:503](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py#line=502), in read_hdf5(fname, title, slash)
    500     return _triage_read(fid[title], slash=slash)
    502 if isinstance(fname, h5py.File):
--> 503     return _read(fname)
    504 else:
    505     with h5py.File(fname, mode="r") as fid:

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:500](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py#line=499), in read_hdf5.<locals>._read(fid)
    498     if "TITLE" not in fid[title].attrs:
    499         raise ValueError('no "%s" data found' % title)
--> 500 return _triage_read(fid[title], slash=slash)

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:570](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py#line=569), in _triage_read(node, slash)
    562     data = multiarray_load(ma_index, ma_data)
    563 elif sys.version_info >= (3, 11):
    564     # Requires python >= 3.11 as python 3.11 added the default implementation
    565     # of the __getstate__() method in the object class.
    566     # Based on https://docs.python.org/3/library/pickle.html#object.__getstate__
    567     return _setstate(
    568         obj_class=_import_class(class_type=type_str),
    569         state_dict={
--> 570             n: _triage_read(node[n], slash="ignore") for n in list(node.keys())
    571         },
    572     )
    573 else:
    574     raise NotImplementedError("Unknown group type: {0}" "".format(type_str))

File ~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:570, in _triage_read(node, slash)
    562     data = multiarray_load(ma_index, ma_data)
    563 elif sys.version_info >= (3, 11):
    564     # Requires python >= 3.11 as python 3.11 added the default implementation
    565     # of the __getstate__() method in the object class.
    566     # Based on https://docs.python.org/3/library/pickle.html#object.__getstate__
    567     return _setstate(
    568         obj_class=_import_class(class_type=type_str),
    569         state_dict={
--> 570             n: _triage_read(node[n], slash="ignore") for n in list(node.keys())
    571         },
    572     )
    573 else:
    574     raise NotImplementedError("Unknown group type: {0}" "".format(type_str))

    [... skipping similar frames: _triage_read at line 570 (2948 times)]

File ~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:570, in _triage_read(node, slash)
    562     data = multiarray_load(ma_index, ma_data)
    563 elif sys.version_info >= (3, 11):
    564     # Requires python >= 3.11 as python 3.11 added the default implementation
    565     # of the __getstate__() method in the object class.
    566     # Based on https://docs.python.org/3/library/pickle.html#object.__getstate__
    567     return _setstate(
    568         obj_class=_import_class(class_type=type_str),
    569         state_dict={
--> 570             n: _triage_read(node[n], slash="ignore") for n in list(node.keys())
    571         },
    572     )
    573 else:
    574     raise NotImplementedError("Unknown group type: {0}" "".format(type_str))

File ~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:532, in _triage_read(node, slash)
    530     if subnode is None:
    531         break
--> 532     data.append(_triage_read(subnode, slash=slash))
    533     ii += 1
    534 assert len(data) == ii

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:570](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py#line=569), in _triage_read(node, slash)
    562     data = multiarray_load(ma_index, ma_data)
    563 elif sys.version_info >= (3, 11):
    564     # Requires python >= 3.11 as python 3.11 added the default implementation
    565     # of the __getstate__() method in the object class.
    566     # Based on https://docs.python.org/3/library/pickle.html#object.__getstate__
    567     return _setstate(
    568         obj_class=_import_class(class_type=type_str),
    569         state_dict={
--> 570             n: _triage_read(node[n], slash="ignore") for n in list(node.keys())
    571         },
    572     )
    573 else:
    574     raise NotImplementedError("Unknown group type: {0}" "".format(type_str))

File ~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:524, in _triage_read(node, slash)
    522             for key_spec, val_spec in special_chars.items():
    523                 key = key.replace(key_spec, val_spec)
--> 524         data[key[4:]] = _triage_read(subnode, slash=slash)
    525 elif type_str in ["list", "tuple"]:
    526     data = list()

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py:599](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5io/_h5io.py#line=598), in _triage_read(node, slash)
    597 elif type_str in ("unicode", "ascii", "str"):  # 'str' for backward compat
    598     decoder = "utf-8" if type_str == "unicode" else "ASCII"
--> 599     data = str(np.array(node).tobytes().decode(decoder))
    600 elif type_str == "json":
    601     node_unicode = str(np.array(node).tobytes().decode("utf-8"))

File h5py[/_objects.pyx:54](http://localhost:8888/_objects.pyx#line=53), in h5py._objects.with_phil.wrapper()

File h5py[/_objects.pyx:55](http://localhost:8888/_objects.pyx#line=54), in h5py._objects.with_phil.wrapper()

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py:1060](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py#line=1059), in Dataset.__array__(self, dtype)
   1057 arr = numpy.zeros(self.shape, dtype=self.dtype if dtype is None else dtype)
   1059 # Special case for (0,)*-shape datasets
-> 1060 if self.size == 0:
   1061     return arr
   1063 self.read_direct(arr)

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py:489](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py#line=488), in Dataset.size(self)
    486 if 'size' in self._cache_props:
    487     return self._cache_props['size']
--> 489 if self._is_empty:
    490     size = None
    491 else:

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/base.py:536](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/base.py#line=535), in cached_property.__get__(self, obj, cls)
    533 if obj is None:
    534     return self
--> 536 value = obj.__dict__[self.func.__name__] = self.func(obj)
    537 return value

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py:634](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py#line=633), in Dataset._is_empty(self)
    631 @cached_property
    632 def _is_empty(self):
    633     """Check if extent type is empty"""
--> 634     return self._extent_type == h5s.NULL

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/base.py:536](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/base.py#line=535), in cached_property.__get__(self, obj, cls)
    533 if obj is None:
    534     return self
--> 536 value = obj.__dict__[self.func.__name__] = self.func(obj)
    537 return value

File h5py[/_objects.pyx:54](http://localhost:8888/_objects.pyx#line=53), in h5py._objects.with_phil.wrapper()

File h5py[/_objects.pyx:55](http://localhost:8888/_objects.pyx#line=54), in h5py._objects.with_phil.wrapper()

File [~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py:629](http://localhost:8888/lab/tree/git_libs/pyiron_nodes/notebooks/~/miniforge3/envs/py12/lib/python3.12/site-packages/h5py/_hl/dataset.py#line=628), in Dataset._extent_type(self)
    625 @cached_property
    626 @with_phil
    627 def _extent_type(self):
    628     """Get extent type for this dataset - SIMPLE, SCALAR or NULL"""
--> 629     return self.id.get_space().get_simple_extent_type()

File h5py[/_objects.pyx:54](http://localhost:8888/_objects.pyx#line=53), in h5py._objects.with_phil.wrapper()

File h5py[/_objects.pyx:55](http://localhost:8888/_objects.pyx#line=54), in h5py._objects.with_phil.wrapper()

RecursionError: maximum recursion depth exceeded
liamhuber commented 2 months ago

Hi JΓΆrg,

I'll have to dig in deep next week. The initialization error looks like it's accessing an already saved copy of the node -- is there a save file there with the same name? If you previously tried and failed to save the node maybe it wrote a partial file?

I'm also not surprised the .save() might return a cyclicity error -- it's using the H5 backend which is still pretty janky. You could try switching the backend ("h5io" and "tinybase"). What I was talking about today was pickle, which has no special interface but you need to call it explicitly pickle.dump(...

JNmpi commented 2 months ago

Thanks @liamhuber. Deleting the stored file helped. Attached a case where dump fails:

  supercell = Workflow.create.atomistic.structure.build.cubic_bulk_cell(element='Ni', cell_size=3, vacancy_index=0)
  m3gnet = Workflow.create.atomistic.engine.ase.M3GNet()
  elastic_constants = atomistic.property.elastic.elastic_constants(
      structure=supercell,
      engine=m3gnet,
      parameters=InputElasticTensor(),
  )

  out = elastic_constants.pull()

  import pickle
  pickle.dumps(m3gnet)   # fails with: PicklingError: Can't pickle <function _lambdifygenerated at 0x30257b880>: attribute lookup _lambdifygenerated on torch failed
liamhuber commented 2 months ago

@JNmpi, this is a much easier case to solve.

Executive summary

This is not an error with our infrastructure -- that's completely pickleable -- but rather with the output data of the M3GNet node. Under the hood some part of the engine object looks like it's using a lambda function, and these are fundamentally un-pickleable.

However, it is still a situation we want to be able to tackle. There are a few possible solutions, mostly outside our code-base, where the most straightforward and robust is to simply use cloudpickle instead.

The big take-home for me is that https://github.com/pyiron/pyiron_workflow/pull/408, which introduces a pickle backend to Node.save(), should promptly be extended to fall back on cloudpickle.

Deep dive

By (de)serializing before and after running, and trying to work with the actual output value itself, we can quickly see the issue is with the un-pickleable output value:

import pickle

import cloudpickle

from pyiron_nodes.atomistic.engine.ase import M3GNet

n = M3GNet()

# Pickle works fine before we execute the node
print(pickle.loads(pickle.dumps(n)).label)
>>> M3GNet
n()
try:
    pickle.dumps(n.outputs.engine.value)
except pickle.PicklingError:
    print("The output data itself causes trouble") 
>>> The output data itself causes trouble

# Cloudpickle works fine even with the data
print(cloudpickle.loads(cloudpickle.dumps(n)).label)  
>>> M3GNet

There are a few options:

  1. Just use cloudpickle. This is robust and straightforward, but since pickling is not a hierarchical data format, it means we need to cloudpickle the entire workflow as soon as one IO element is not pickleable. Since cloudpickle is slower that pickle, this is an annoying downside.
  2. Go upstream and make the data type pickleable. This is high-benefit, but it's also extremely high-cost, so at the end of the day I don't see it being a profitable attack. If the problem lay in matgl there might be a chance, but it looks like maybe it's in pytorch. Not only would we have to understand that code well enough to "fix" it, but we'd also need to convince the owners to merge our fix. Unlikely.
  3. Fix, then use tinybase storage. @pmrv's tinybase storage is a lot like h5io, but when it fails it should fall back on cloudpickle. In principle, this resolves the downside to (1), where we only cloudpickle the offensive piece and everything else gets a prettier serialization. In practice, not only does this raise the same pickling error (a surprise to me), but it also leaves a corrupted save file around that caused your initial recursion problem (this is for sure my fault for being sloppy in a failure case somewhere). The "fix" in the opening sentence does a lot of work here, and this is also an expensive attack.

An aside on h5io

Even if we did fix the current issue with tinybase so it correctly cloudpickles the engine object here, it is still built on top of h5io and thus inherits some weaknesses. One of the big ones is that tinybase only "fails" and thus falls back to cloudpickle when h5io fails explicitly, but there's some cases where h5io fails in a silent and pernicious way. E.g.)

from pyiron_snippets.dotdict import DotDict
from pyiron_workflow import Workflow

n = Workflow.create.standard.UserInput(DotDict({"a": 42}), label="h5io_fail")
n.save()
reloaded = Workflow.create.standard.UserInput(label="h5io_fail")

print(type(n.inputs.user_input.value))
>>> <class 'pyiron_snippets.dotdict.DotDict'>
print(type(reloaded.inputs.user_input.value))
>>> <class 'dict'>

That's not the class I saved! So "fix" in (3) largely amounts to our (failed) SDG proposal. It is definitely possible, but lots of work.

pmrv commented 2 months ago

3. Fix, then use tinybase storage. @pmrv's tinybase storage is a lot like h5io, but when it fails it should fall back on cloudpickle. In principle, this resolves the downside to (1), where we only cloudpickle the offensive piece and everything else gets a prettier serialization. In practice, not only does this raise the same pickling error (a surprise to me), but it also leaves a corrupted save file around that caused your initial recursion problem (this is for sure my fault for being sloppy in a failure case somewhere). The "fix" in the opening sentence does a lot of work here, and this is also an expensive attack.

This is likely because tinybase storage also uses normal pickle as the fallback, moving it to cloudpickle would be straightforward, though. If there's continued interest in the storage from the workflow side, we can think about the moving it from contrib to its own place (it is somewhat uncoupled from the rest of tinybase).

pmrv commented 2 months ago

Even if we did fix the current issue with tinybase so it correctly cloudpickles the engine object here, it is still built on top of h5io and thus inherits some weaknesses. One of the big ones is that tinybase only "fails" and thus falls back to cloudpickle when h5io fails explicitly, but there's some cases where h5io fails in a silent and pernicious way. E.g.)

This could be fixed in h5io{,_browser} somewhat easily I think (by tightening the typechecks), if we deem it a priority, but also just implementing Storable on DotDict should sidestep the issue. It depends a bit on what else is a problem, but if it is only about DotDict, we could even make some very generic wrappers in tinybase for any Mapping, so that your workflow code doesn't have to do it.

liamhuber commented 2 months ago

If there's continued interest in the storage from the workflow side, we can think about the moving it from contrib to its own place (it is somewhat uncoupled from the rest of tinybase).

πŸ‘ πŸ‘

This could be fixed in h5io{,_browser} somewhat easily I think (by tightening the typechecks)

Yes, this is possible. h5io actually early-exits on isinstance checks, so it is not a simple extension of the final if clause in its logical flow (like our additions for custom classes was), but involves going back and modifying the flow to account for subclassing consistently through the entire routine.

if we deem it a priority, but also just implementing Storable on DotDict should sidestep the issue

Absolutely, and this extensibility is something I really like in the tinybase design. In this case, I wouldn't want to do it on DotDict itself, as pyiron_snippets should have no awareness of tinybase; we would need to sub-class DotDict here. More generally, needing to implement Storable is a show-stopper for me -- the crux of the issue here is that we want a tool that will play nicely with (more or less^1) arbitrary user data running through workflows as IO, so requiring them to be aware of and implement some extra storage method is no good.

It depends a bit on what else is a problem, but if it is only about DotDict...

Like I alluded to earlier, the fundamental problem is that h5io is doing early-stopping when what's passed to it passes an isinstance check on any of its white-listed types. I'm sure we could work around this, but any tool building on h5io needs to make sure it takes care of this on its own because otherwise h5io will silently do the wrong thing (sidestepping tinybase's plan of falling back on another tool when h5io fails). Then there are lesser issues like h5io not handling recursion^2.

This is not to say h5io is bad or "wrong" -- if you're passing it any of its whitelisted datatypes it behaves brilliantly. It fills its design purpose well. The problem for us is just that it was never designed to be a generic storage routine the same way pickle is, and we want something users can pass all sorts of data to. Even the extensions for (limited) custom class storage was something our group tacked onto it.

I agree it should be possible to build a (nearly) universal tool on top of h5io with some combination of modifications there and extra careful pre- and post-processing for what gets passed to it, but I think it would be less work and more robust to just design an interface to h5py that intends to be a universal^1 interface to h5 from the beginning.

1) Truly "arbitrary" data is hard, but I think it's fair and best to use the existing pickle conditions for baseline behaviour, and ideally to fall back on cloudpickle if the user passes something that doesn't comply with that.

2) Where "recursion" means

class Child:
    def __init__(self, parent):
        self.parent = parent

class Parent:
    def __init__(self):
        self.child = Child(self)

Parent()
pmrv commented 2 months ago

This could be fixed in h5io{,_browser} somewhat easily I think (by tightening the typechecks)

Yes, this is possible. h5io actually early-exits on isinstance checks, so it is not a simple extension of the final if clause in its logical flow (like our additions for custom classes was), but involves going back and modifying the flow to account for subclassing consistently through the entire routine.

It depends a bit on what else is a problem, but if it is only about DotDict...

Like I alluded to earlier, the fundamental problem is that h5io is doing early-stopping when what's passed to it passes an isinstance check on any of its white-listed types. I'm sure we could work around this, but any tool building on h5io needs to make sure it takes care of this on its own because otherwise h5io will silently do the wrong thing (sidestepping tinybase's plan of falling back on another tool when h5io fails). Then there are lesser issues like h5io not handling recursion^2.

The type instability is a problem. I was considering to suggest to upstream to simply switch isinstance to type(...) == ... checks potentially behind a switch, but obviously this needs discussion with the maintainer. Though, I do think that likely, to make a clean abstraction between the physical format and our interface layer, we'll need something of an explicit type whitelist anyway. So in that case we could sidestep this particular issue as well, by only passing known good types directly to h5io and refusing/Storable'ing everything else.

if we deem it a priority, but also just implementing Storable on DotDict should sidestep the issue

Absolutely, and this extensibility is something I really like in the tinybase design. In this case, I wouldn't want to do it on DotDict itself, as pyiron_snippets should have no awareness of tinybase; we would need to sub-class DotDict here. More generally, needing to implement Storable is a show-stopper for me -- the crux of the issue here is that we want a tool that will play nicely with (more or less^1) arbitrary user data running through workflows as IO, so requiring them to be aware of and implement some extra storage method is no good.

Oh I agree that generally we don't want this, just thought it could solve this local problem. I made a small prototype today that uses __reduce__ to implement Storable, which could be the ultimate fallback.

This is not to say h5io is bad or "wrong" -- if you're passing it any of its whitelisted datatypes it behaves brilliantly. It fills its design purpose well. The problem for us is just that it was never designed to be a generic storage routine the same way pickle is, and we want something users can pass all sorts of data to. Even the extensions for (limited) custom class storage was something our group tacked onto it.

I agree it should be possible to build a (nearly) universal tool on top of h5io with some combination of modifications there and extra careful pre- and post-processing for what gets passed to it, but I think it would be less work and more robust to just design an interface to h5py that intends to be a universal^1 interface to h5 from the beginning.

I'm not married to h5io, likely a h5py based implementation of GenericStorage wouldn't be dramatically more complicated.

pmrv commented 2 months ago

If there's continued interest in the storage from th

  1. Where "recursion" means
class Child:
    def __init__(self, parent):
        self.parent = parent

class Parent:
    def __init__(self):
        self.child = Child(self)

Parent()

Cyclic dependencies are tricky and I don't think we'll get that in a hierarchical format easily. (It should be possible, but annoying to implement. I think pickle does it my memo-izing already written objects and adding back-references to them) Are those cases we assume for user data? Because if we think this will mostly come from our infrastruture, it'll be easier to overload __getstate__ or implement Storable for these cases.

liamhuber commented 2 months ago

I was considering to suggest to upstream to simply switch isinstance to type(...) == ... checks potentially behind a switch, but obviously this needs discussion with the maintainer.

πŸ’― I don't think he'd be closed to it either. I don't have the time to go doing it, especially with even a small risk it might get rejected regardless. But that's the direction to go IMO.

Though, I do think that likely, to make a clean abstraction between the physical format and our interface layer, we'll need something of an explicit type whitelist anyway. ... Are those cases we assume for user data? Because if we think this will mostly come from our infrastruture...

This is the heart of the problem: the ideal solution lets users leverage the workflows with whatever data they want, and serialization should just deal with it. I agree we need some sort of interface requirement, but I think demanding anything beyond "its pickle-able" is asking too much.

I'm not married to h5io, likely a h5py based implementation of GenericStorage wouldn't be dramatically more complicated.

I'm also not dead-set against upgrading h5io to meet our requirements, I just think that the technological and organizational overhead is high enough that greenfielding something that's designed to meet our needs from the start is a net win.

Cyclic dependencies are tricky and I don't think we'll get that in a hierarchical format easily. ...it'll be easier to overload __getstate__ or implement Storable for these cases.

πŸ’― When I was trying to think about how to manage it for the hypothetical SDG, a memoization scheme where we store an intermediate access-map (formatted like the object structure) and putting multiply-accessed data in a single location whenever the user goes looking at a map location. For hierarchical storage it -- at minimum -- destroys the storage lining up 1:1 with the storage access. Overloading __getstate__ is exactly how I've been doing it here, so that the h5io and tinybase backends will work at all. In some situations we actually really want this sort of de-parenting: e.g. if you want to serialize a node to ship it to a python process, you don't want __getstate__ to find the parent, and then the parents recursively upwards forever -- you'd wind up shipping a gratuitous amount of data to the new process! On the other hand, for something like the channels, which also have a recursive parent-child relationship with their owning node, the channel is never being shipped off sans parent anyhow, so purging the owner in __getstate__ was just an awkward accommodation for the hierarchical storage back-ends.

Oh I agree that generally we don't want this, just thought it could solve this local problem. I made a small prototype today that uses __reduce__ to implement Storable, which could be the ultimate fallback.

πŸš€ Nice! If we can replace Storable with __reduce__ one way or another, then I think we're in business. I'm in the process of purging the h5io and tinybase back-ends right now (this will allow me to get rid of contrib and base dependence), but I'll leave the tests set up that we can plug in a different storage interface and put it through its paces easily.

liamhuber commented 1 month ago

Serialization is now only officially with (cloud)pickle and working stably. The ideas here for h5-based storage are worth reading, but the issue itself is now gone.