Closed JNmpi closed 1 month ago
Hi JΓΆrg,
I'll have to dig in deep next week. The initialization error looks like it's accessing an already saved copy of the node -- is there a save file there with the same name? If you previously tried and failed to save the node maybe it wrote a partial file?
I'm also not surprised the .save()
might return a cyclicity error -- it's using the H5 backend which is still pretty janky. You could try switching the backend ("h5io" and "tinybase"). What I was talking about today was pickle, which has no special interface but you need to call it explicitly pickle.dump(...
Thanks @liamhuber. Deleting the stored file helped. Attached a case where dump fails:
supercell = Workflow.create.atomistic.structure.build.cubic_bulk_cell(element='Ni', cell_size=3, vacancy_index=0)
m3gnet = Workflow.create.atomistic.engine.ase.M3GNet()
elastic_constants = atomistic.property.elastic.elastic_constants(
structure=supercell,
engine=m3gnet,
parameters=InputElasticTensor(),
)
out = elastic_constants.pull()
import pickle
pickle.dumps(m3gnet) # fails with: PicklingError: Can't pickle <function _lambdifygenerated at 0x30257b880>: attribute lookup _lambdifygenerated on torch failed
@JNmpi, this is a much easier case to solve.
This is not an error with our infrastructure -- that's completely pickleable -- but rather with the output data of the M3GNet
node. Under the hood some part of the engine object looks like it's using a lambda function, and these are fundamentally un-pickleable.
However, it is still a situation we want to be able to tackle. There are a few possible solutions, mostly outside our code-base, where the most straightforward and robust is to simply use cloudpickle
instead.
The big take-home for me is that https://github.com/pyiron/pyiron_workflow/pull/408, which introduces a pickle
backend to Node.save()
, should promptly be extended to fall back on cloudpickle
.
By (de)serializing before and after running, and trying to work with the actual output value itself, we can quickly see the issue is with the un-pickleable output value:
import pickle
import cloudpickle
from pyiron_nodes.atomistic.engine.ase import M3GNet
n = M3GNet()
# Pickle works fine before we execute the node
print(pickle.loads(pickle.dumps(n)).label)
>>> M3GNet
n()
try:
pickle.dumps(n.outputs.engine.value)
except pickle.PicklingError:
print("The output data itself causes trouble")
>>> The output data itself causes trouble
# Cloudpickle works fine even with the data
print(cloudpickle.loads(cloudpickle.dumps(n)).label)
>>> M3GNet
There are a few options:
cloudpickle
. This is robust and straightforward, but since pickling is not a hierarchical data format, it means we need to cloudpickle
the entire workflow as soon as one IO element is not pickleable. Since cloudpickle
is slower that pickle
, this is an annoying downside.matgl
there might be a chance, but it looks like maybe it's in pytorch
. Not only would we have to understand that code well enough to "fix" it, but we'd also need to convince the owners to merge our fix. Unlikely.tinybase
storage. @pmrv's tinybase
storage is a lot like h5io
, but when it fails it should fall back on cloudpickle
. In principle, this resolves the downside to (1), where we only cloudpickle
the offensive piece and everything else gets a prettier serialization. In practice, not only does this raise the same pickling error (a surprise to me), but it also leaves a corrupted save file around that caused your initial recursion problem (this is for sure my fault for being sloppy in a failure case somewhere). The "fix" in the opening sentence does a lot of work here, and this is also an expensive attack.h5io
Even if we did fix the current issue with tinybase
so it correctly cloudpickles the engine object here, it is still built on top of h5io
and thus inherits some weaknesses. One of the big ones is that tinybase
only "fails" and thus falls back to cloudpickle
when h5io
fails explicitly, but there's some cases where h5io
fails in a silent and pernicious way. E.g.)
from pyiron_snippets.dotdict import DotDict
from pyiron_workflow import Workflow
n = Workflow.create.standard.UserInput(DotDict({"a": 42}), label="h5io_fail")
n.save()
reloaded = Workflow.create.standard.UserInput(label="h5io_fail")
print(type(n.inputs.user_input.value))
>>> <class 'pyiron_snippets.dotdict.DotDict'>
print(type(reloaded.inputs.user_input.value))
>>> <class 'dict'>
That's not the class I saved! So "fix" in (3) largely amounts to our (failed) SDG proposal. It is definitely possible, but lots of work.
3. Fix, then use
tinybase
storage. @pmrv'stinybase
storage is a lot likeh5io
, but when it fails it should fall back oncloudpickle
. In principle, this resolves the downside to (1), where we onlycloudpickle
the offensive piece and everything else gets a prettier serialization. In practice, not only does this raise the same pickling error (a surprise to me), but it also leaves a corrupted save file around that caused your initial recursion problem (this is for sure my fault for being sloppy in a failure case somewhere). The "fix" in the opening sentence does a lot of work here, and this is also an expensive attack.
This is likely because tinybase storage also uses normal pickle as the fallback, moving it to cloudpickle would be straightforward, though. If there's continued interest in the storage from the workflow side, we can think about the moving it from contrib to its own place (it is somewhat uncoupled from the rest of tinybase).
Even if we did fix the current issue with
tinybase
so it correctly cloudpickles the engine object here, it is still built on top ofh5io
and thus inherits some weaknesses. One of the big ones is thattinybase
only "fails" and thus falls back tocloudpickle
whenh5io
fails explicitly, but there's some cases whereh5io
fails in a silent and pernicious way. E.g.)
This could be fixed in h5io{,_browser} somewhat easily I think (by tightening the typechecks), if we deem it a priority, but also just implementing Storable
on DotDict
should sidestep the issue. It depends a bit on what else is a problem, but if it is only about DotDict
, we could even make some very generic wrappers in tinybase for any Mapping
, so that your workflow code doesn't have to do it.
If there's continued interest in the storage from the workflow side, we can think about the moving it from contrib to its own place (it is somewhat uncoupled from the rest of tinybase).
π π
This could be fixed in h5io{,_browser} somewhat easily I think (by tightening the typechecks)
Yes, this is possible. h5io
actually early-exits on isinstance
checks, so it is not a simple extension of the final if
clause in its logical flow (like our additions for custom classes was), but involves going back and modifying the flow to account for subclassing consistently through the entire routine.
if we deem it a priority, but also just implementing
Storable
onDotDict
should sidestep the issue
Absolutely, and this extensibility is something I really like in the tinybase design. In this case, I wouldn't want to do it on DotDict
itself, as pyiron_snippets
should have no awareness of tinybase; we would need to sub-class DotDict
here. More generally, needing to implement Storable
is a show-stopper for me -- the crux of the issue here is that we want a tool that will play nicely with (more or less^1) arbitrary user data running through workflows as IO, so requiring them to be aware of and implement some extra storage method is no good.
It depends a bit on what else is a problem, but if it is only about
DotDict
...
Like I alluded to earlier, the fundamental problem is that h5io
is doing early-stopping when what's passed to it passes an isinstance
check on any of its white-listed types. I'm sure we could work around this, but any tool building on h5io
needs to make sure it takes care of this on its own because otherwise h5io
will silently do the wrong thing (sidestepping tinybase's plan of falling back on another tool when h5io
fails). Then there are lesser issues like h5io
not handling recursion^2.
This is not to say h5io
is bad or "wrong" -- if you're passing it any of its whitelisted datatypes it behaves brilliantly. It fills its design purpose well. The problem for us is just that it was never designed to be a generic storage routine the same way pickle is, and we want something users can pass all sorts of data to. Even the extensions for (limited) custom class storage was something our group tacked onto it.
I agree it should be possible to build a (nearly) universal tool on top of h5io
with some combination of modifications there and extra careful pre- and post-processing for what gets passed to it, but I think it would be less work and more robust to just design an interface to h5py
that intends to be a universal^1 interface to h5 from the beginning.
1) Truly "arbitrary" data is hard, but I think it's fair and best to use the existing pickle conditions for baseline behaviour, and ideally to fall back on cloudpickle if the user passes something that doesn't comply with that.
2) Where "recursion" means
class Child:
def __init__(self, parent):
self.parent = parent
class Parent:
def __init__(self):
self.child = Child(self)
Parent()
This could be fixed in h5io{,_browser} somewhat easily I think (by tightening the typechecks)
Yes, this is possible.
h5io
actually early-exits onisinstance
checks, so it is not a simple extension of the finalif
clause in its logical flow (like our additions for custom classes was), but involves going back and modifying the flow to account for subclassing consistently through the entire routine.It depends a bit on what else is a problem, but if it is only about
DotDict
...Like I alluded to earlier, the fundamental problem is that
h5io
is doing early-stopping when what's passed to it passes anisinstance
check on any of its white-listed types. I'm sure we could work around this, but any tool building onh5io
needs to make sure it takes care of this on its own because otherwiseh5io
will silently do the wrong thing (sidestepping tinybase's plan of falling back on another tool whenh5io
fails). Then there are lesser issues likeh5io
not handling recursion^2.
The type instability is a problem. I was considering to suggest to upstream to simply switch isinstance
to type(...) == ...
checks potentially behind a switch, but obviously this needs discussion with the maintainer. Though, I do think that likely, to make a clean abstraction between the physical format and our interface layer, we'll need something of an explicit type whitelist anyway. So in that case we could sidestep this particular issue as well, by only passing known good types directly to h5io and refusing/Storable'ing everything else.
if we deem it a priority, but also just implementing
Storable
onDotDict
should sidestep the issueAbsolutely, and this extensibility is something I really like in the tinybase design. In this case, I wouldn't want to do it on
DotDict
itself, aspyiron_snippets
should have no awareness of tinybase; we would need to sub-classDotDict
here. More generally, needing to implementStorable
is a show-stopper for me -- the crux of the issue here is that we want a tool that will play nicely with (more or less^1) arbitrary user data running through workflows as IO, so requiring them to be aware of and implement some extra storage method is no good.
Oh I agree that generally we don't want this, just thought it could solve this local problem. I made a small prototype today that uses __reduce__
to implement Storable
, which could be the ultimate fallback.
This is not to say
h5io
is bad or "wrong" -- if you're passing it any of its whitelisted datatypes it behaves brilliantly. It fills its design purpose well. The problem for us is just that it was never designed to be a generic storage routine the same way pickle is, and we want something users can pass all sorts of data to. Even the extensions for (limited) custom class storage was something our group tacked onto it.I agree it should be possible to build a (nearly) universal tool on top of
h5io
with some combination of modifications there and extra careful pre- and post-processing for what gets passed to it, but I think it would be less work and more robust to just design an interface toh5py
that intends to be a universal^1 interface to h5 from the beginning.
I'm not married to h5io
, likely a h5py
based implementation of GenericStorage
wouldn't be dramatically more complicated.
If there's continued interest in the storage from th
- Where "recursion" means
class Child: def __init__(self, parent): self.parent = parent class Parent: def __init__(self): self.child = Child(self) Parent()
Cyclic dependencies are tricky and I don't think we'll get that in a hierarchical format easily. (It should be possible, but annoying to implement. I think pickle does it my memo-izing already written objects and adding back-references to them) Are those cases we assume for user data? Because if we think this will mostly come from our infrastruture, it'll be easier to overload __getstate__
or implement Storable
for these cases.
I was considering to suggest to upstream to simply switch
isinstance to type(...) == ...
checks potentially behind a switch, but obviously this needs discussion with the maintainer.
π― I don't think he'd be closed to it either. I don't have the time to go doing it, especially with even a small risk it might get rejected regardless. But that's the direction to go IMO.
Though, I do think that likely, to make a clean abstraction between the physical format and our interface layer, we'll need something of an explicit type whitelist anyway. ... Are those cases we assume for user data? Because if we think this will mostly come from our infrastruture...
This is the heart of the problem: the ideal solution lets users leverage the workflows with whatever data they want, and serialization should just deal with it. I agree we need some sort of interface requirement, but I think demanding anything beyond "its pickle-able" is asking too much.
I'm not married to
h5io
, likely ah5py
based implementation ofGenericStorage
wouldn't be dramatically more complicated.
I'm also not dead-set against upgrading h5io
to meet our requirements, I just think that the technological and organizational overhead is high enough that greenfielding something that's designed to meet our needs from the start is a net win.
Cyclic dependencies are tricky and I don't think we'll get that in a hierarchical format easily. ...it'll be easier to overload
__getstate__
or implementStorable
for these cases.
π― When I was trying to think about how to manage it for the hypothetical SDG, a memoization scheme where we store an intermediate access-map (formatted like the object structure) and putting multiply-accessed data in a single location whenever the user goes looking at a map location. For hierarchical storage it -- at minimum -- destroys the storage lining up 1:1 with the storage access.
Overloading __getstate__
is exactly how I've been doing it here, so that the h5io
and tinybase
backends will work at all. In some situations we actually really want this sort of de-parenting: e.g. if you want to serialize a node to ship it to a python process, you don't want __getstate__
to find the parent, and then the parents recursively upwards forever -- you'd wind up shipping a gratuitous amount of data to the new process! On the other hand, for something like the channels, which also have a recursive parent-child relationship with their owning node, the channel is never being shipped off sans parent anyhow, so purging the owner
in __getstate__
was just an awkward accommodation for the hierarchical storage back-ends.
Oh I agree that generally we don't want this, just thought it could solve this local problem. I made a small prototype today that uses
__reduce__
to implementStorable
, which could be the ultimate fallback.
π Nice! If we can replace Storable
with __reduce__
one way or another, then I think we're in business. I'm in the process of purging the h5io
and tinybase
back-ends right now (this will allow me to get rid of contrib and base dependence), but I'll leave the tests set up that we can plug in a different storage interface and put it through its paces easily.
Serialization is now only officially with (cloud)pickle and working stably. The ideas here for h5-based storage are worth reading, but the issue itself is now gone.
I have updated pyiron_workflow to the latest version in main. When running
engine = Workflow.create.atomistic.engine.ase.M3GNet()
I get the following lengthy error message attached at the end. With my older pyiron_workflow version this worked but I got the recursion error when calling:
engine.save()
I guess that I get the error already when just creating the node function is related to the new cache functionality, which is internally doing something similar as the save statement.