kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.89k stars 897 forks source link

ShareMemoryDataset does not have `exists()` method #3703

Closed noklam closed 1 week ago

noklam commented 6 months ago

Description

image

Context

Unclear, the SharedMemoryDataset is implemented by the core team and generating warning when run with ParallelRunner, is this false alarm or something we need to address? Cc @merelcht

Steps to Reproduce

I was testing something else and see the warning with this command: pytest tests/framework/session/test_session_extension_hooks.py::TestNodeHooks::test_on_node_error_hook_parallel_runner

Possible Solution:

https://github.com/kedro-org/kedro/blob/0fc8089b637a0679f71e2bddc91f0676fc2914a2/kedro/io/memory_dataset.py#L74C1-L75C40 We have a simple check in MemoryDataset, we could implement simliar logic for ShareMemoryDataset.

Check whether it makes sense to have an _exists() method for SharedMemoryDataset. If not: update the logging level to be of DEBUG.

merelcht commented 6 months ago

The exists() method is not an abstract method and so doesn't have to be implemented. The original SharedMemoryDataset was wrapped around MemoryDataset directly so I guess that's why the warning wasn't triggered before https://github.com/kedro-org/kedro/pull/3332/files#diff-cc4e916905ba553d257700ff851973772c0d77018d7673a2c4b22f309d8c3e46. IMO we can just ignore this.

noklam commented 6 months ago

@merelcht Haven't look into this in details, would the alternative be changing that WARNING message to INFO instead?

merelcht commented 1 month ago

Reported again recently: https://github.com/kedro-org/kedro/issues/4060

Updated the scope to double check where exists is called and if it makes sense to implement this method on SharedMemoryDataset.

Also related: https://github.com/kedro-org/kedro/issues/4078

felixscherz commented 1 month ago

Hi, I investigated this for a bit and I believe the issue is that SharedMemoryDataset delegates most methods to the underlying MemoryDataset via __getattr__. https://github.com/kedro-org/kedro/blob/56e56a783e4b79e96d09c20918a23bb3e11976d1/kedro/io/shared_memory_dataset.py#L33-L37

However, __getattr__ is only called for attributes that do not exist on the object and since SharedMemoryDataset inherits AbstractDataset._exists __getattr__ is never called.

We could implement _exists on SharedMemoryDataset, we would have to handle the case where self.shared_memory_dataset is None (return False in this case?):

class SharedMemoryDataset(AbstractDataset):
    ...
    def _exists(self) -> bool:
        if not self.shared_memory_dataset:
            return False
        return self.shared_memory_dataset.exists()

I'm happy to submit a PR implementing this:)