Closed albertz closed 1 month ago
I'm running apptek_asr CI w/ the latest commit here to see if AppTek was affected by the bug.
Btw, I just noticed, our extract_paths
has the same problem. I actually wonder, why does it not use get_object_state
, but instead its own logic? Should I also fix this in this PR here, or make a separate PR for this (I think I would just use get_object_state
there)?
-> @curufinwe said separate PR.
So, it seems like w/ the new implementation pathlib.Path
hashes differently. Old sis gave:
>>> p = pathlib.Path("/home/mgunz/src/apptek_asr/[...]/test-hash.py")
>>> from sisyphus.hash import get_object_state
>>> get_object_state(p)
{}
New sis gives:
>>> p = pathlib.Path("/home/mgunz/src/apptek_asr/[...]/test-hash.py")
>>> from sisyphus.hash import get_object_state
>>> get_object_state(p)
{'_drv': '', '_parts': ['/', 'home', 'mgunz', 'src', 'apptek_asr', ..., 'test-hash.py'], '_root': '/'}
The reason for this is that if you instantiate a pathlib.Path
, on unix systems you get
class PosixPath(Path, PurePosixPath):
"""Path subclass for non-Windows systems.
On a POSIX system, instantiating a Path should return this object.
"""
__slots__ = ()
Notice the empty slots. Before this PR, sisyphus would therefore not take into account any properties for the hash. get_object_state(pathlib.Path)
would give you (PosixPath, (dict))
. Now, w/ the changes you instead get PosixPath, (dict, (tuple, (str, '_cached_cparts'), (list, (str, '/'), (str, 'nas'), (str, 'data'), (str, 'speech'), ...)), (tuple, (str, '_drv'), (str, '')), (tuple, (str, '_parts'), (list, (str, '/'), (str, 'nas'), (str, 'data'), (str, 'speech'), ...)), (tuple, (str, '_root'), (str, '/'))))
.
All in all the previous behavior was extremely broken, so I think we should merge this as-is and accept the breakage.
Ah, PosixPath
seems to behave somewhat bad in both cases though (although some state is obviously better than no state at all). The state you get there with the new code doesn't really seem to be so stable across Python versions.
~But I also wonder, the __slots__
is always empty? That means, all the state should be in __dict__
? So then the old code should have worked already, or not?~
__slots__
is not empty. It derives from PurePath
, and that defines:
__slots__ = (
'_drv', '_root', '_parts',
'_str', '_hash', '_pparts', '_cached_cparts',
)
Btw, also please say what Python version you used. For the new code, the Python version should not make a difference (well, despite if the object changed its internal representation - that might also be the case for PosixPath
, I don't know). For the old code, it will also be different depending on Python version. With Python >=3.11, I think the state was not empty for PosixPath
. Maybe check that as well.
In any case, maybe it makes sense to treat PosixPath
as a special case anyway, and just store its path as state? (But that would be a separate PR then.)
Also, I wonder, is that correct that you have PosixPath
as argument here? Maybe you actually want to have a Sisyphus Path
instead?
In general, yes. A sisyphus Path would be preferable, but we should make sure that sisyphus also works correctly with PosixPath. I would not mind handling it separately, but then we should do it in this PR to avoid having another commit than changes hashes.
The python version was 3.10 in the case in question.
Ok, I added special case for pathlib.PurePath
.
Fix https://github.com/rwth-i6/sisyphus/issues/207
See #207 for some discussion.
Note that this slightly changes the behavior, but only for cases which were basically broken before.
E.g., for
functools.partial
, before this change, none of the relevant args (func
,args
,keywords
) were ever part of the hash. So this change should not really break anything.The same is also true for cases where there was both a
__dict__
and__slots__
. With Python <=3.10, it would have used only the__dict__
, and very likely, that was just empty, and does not contain any of the relevant things (which are very likely all in the__slots__
), so it was totally broken and basically unusable. With Python >=3.11, there is a default__getstate__
now, which would already do the correct thing, at least for "normal" (non-native) types, i.e. merge__dict__
and__slots__
, i.e. our existingget_object_state
was correct for those cases. With this change, Python <=3.10 is also correct now.In any case, we check through
dir
now, and we check for member descriptors, which should also cover such native types (likefunctools.partial
).