rwth-i6 / sisyphus

A Workflow Manager in Python
Mozilla Public License 2.0
45 stars 25 forks source link

hash get_object_state, use dir for native types, fix __slots__ #208

Closed albertz closed 1 month ago

albertz commented 1 month ago

Fix https://github.com/rwth-i6/sisyphus/issues/207

See #207 for some discussion.

Note that this slightly changes the behavior, but only for cases which were basically broken before.

E.g., for functools.partial, before this change, none of the relevant args (func, args, keywords) were ever part of the hash. So this change should not really break anything.

The same is also true for cases where there was both a __dict__ and __slots__. With Python <=3.10, it would have used only the __dict__, and very likely, that was just empty, and does not contain any of the relevant things (which are very likely all in the __slots__), so it was totally broken and basically unusable. With Python >=3.11, there is a default __getstate__ now, which would already do the correct thing, at least for "normal" (non-native) types, i.e. merge __dict__ and __slots__, i.e. our existing get_object_state was correct for those cases. With this change, Python <=3.10 is also correct now.

In any case, we check through dir now, and we check for member descriptors, which should also cover such native types (like functools.partial).

NeoLegends commented 1 month ago

I'm running apptek_asr CI w/ the latest commit here to see if AppTek was affected by the bug.

albertz commented 1 month ago

Btw, I just noticed, our extract_paths has the same problem. I actually wonder, why does it not use get_object_state, but instead its own logic? Should I also fix this in this PR here, or make a separate PR for this (I think I would just use get_object_state there)?

-> @curufinwe said separate PR.

NeoLegends commented 1 month ago

So, it seems like w/ the new implementation pathlib.Path hashes differently. Old sis gave:

>>> p = pathlib.Path("/home/mgunz/src/apptek_asr/[...]/test-hash.py")
>>> from sisyphus.hash import get_object_state
>>> get_object_state(p)
{}

New sis gives:

>>> p = pathlib.Path("/home/mgunz/src/apptek_asr/[...]/test-hash.py")
>>> from sisyphus.hash import get_object_state
>>> get_object_state(p)
{'_drv': '', '_parts': ['/', 'home', 'mgunz', 'src', 'apptek_asr', ..., 'test-hash.py'], '_root': '/'}

The reason for this is that if you instantiate a pathlib.Path, on unix systems you get

class PosixPath(Path, PurePosixPath):
    """Path subclass for non-Windows systems.

    On a POSIX system, instantiating a Path should return this object.
    """
    __slots__ = ()

Notice the empty slots. Before this PR, sisyphus would therefore not take into account any properties for the hash. get_object_state(pathlib.Path) would give you (PosixPath, (dict)). Now, w/ the changes you instead get PosixPath, (dict, (tuple, (str, '_cached_cparts'), (list, (str, '/'), (str, 'nas'), (str, 'data'), (str, 'speech'), ...)), (tuple, (str, '_drv'), (str, '')), (tuple, (str, '_parts'), (list, (str, '/'), (str, 'nas'), (str, 'data'), (str, 'speech'), ...)), (tuple, (str, '_root'), (str, '/')))).

All in all the previous behavior was extremely broken, so I think we should merge this as-is and accept the breakage.

albertz commented 1 month ago

Ah, PosixPath seems to behave somewhat bad in both cases though (although some state is obviously better than no state at all). The state you get there with the new code doesn't really seem to be so stable across Python versions.

~But I also wonder, the __slots__ is always empty? That means, all the state should be in __dict__? So then the old code should have worked already, or not?~ __slots__ is not empty. It derives from PurePath, and that defines:

    __slots__ = (
        '_drv', '_root', '_parts',
        '_str', '_hash', '_pparts', '_cached_cparts',
    )

Btw, also please say what Python version you used. For the new code, the Python version should not make a difference (well, despite if the object changed its internal representation - that might also be the case for PosixPath, I don't know). For the old code, it will also be different depending on Python version. With Python >=3.11, I think the state was not empty for PosixPath. Maybe check that as well.

In any case, maybe it makes sense to treat PosixPath as a special case anyway, and just store its path as state? (But that would be a separate PR then.)

Also, I wonder, is that correct that you have PosixPath as argument here? Maybe you actually want to have a Sisyphus Path instead?

curufinwe commented 1 month ago

In general, yes. A sisyphus Path would be preferable, but we should make sure that sisyphus also works correctly with PosixPath. I would not mind handling it separately, but then we should do it in this PR to avoid having another commit than changes hashes.

michelwi commented 1 month ago

The python version was 3.10 in the case in question.

albertz commented 1 month ago

Ok, I added special case for pathlib.PurePath.