kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Clean up `catalog.datasets` and `catalog._data_sets` #2999

Open noklam opened 11 months ago

noklam commented 11 months ago

Description

In #2998, an example is added to access metadata in data catalog. This is useful when user need to extend Kedro, i.e. kedro-viz or plugin developers. We provide an example as follow

class MetadataHook:
    @hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
    ):
        for dataset_name, dataset in catalog.datasets.__dict__.items():
            print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

It used an internal __dict__ which is not ideal. On the other hand, we also want to improve the usability of the class in standalone or interactive mode

They are used inconsistently in the codebase and all behavior slightly differently. catalog.datasets is the only "public" one which allow user to interact with the "read-only" FrozenDataset.

However, there is caveat to use catalog.datasets. For example, if you have transcoding dataset, it get converted to some internal non-readable string.

for dataset_name, dataset in catalog.datasets.__dict__.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

With this hook, if you have a dataset name X_train@spark, the dataset_name will be X_train__spark which is unexpected when you try to access the dictionary. On the other hand, catalog.datasets is not subscriptable, so you cannot access it by index or name.

This is due to the addition of the catalog.dataset_name syntax, you cannot have a Python property with @ or # in it, and it should be a valid Python name (mostly designed for interactive mode). The conversion happens here: https://github.com/kedro-org/kedro/blob/0293dc15812b27330bba31a01c7b332b3165af2a/kedro/io/data_catalog.py#L89-L99

All of these lead to a conclusion that we should try to simply the interface (both public and internal). DataCatalog and it attributes are non-readable, it make it really hard to work in interactive mode because you cannot see what's inside the attribute without looking at the source code.

The main challenge here is to make things consistent without breaking change (or if we need breaking change we should do it now before 0.19). We should try to keep catalog.datasets since this it is the public interface people relies on. Things that I am not sure about is that also tightly couple to kedro-viz? We should check when this is implemented.

Possible Implementation

We can improve the usablility with auto-completion and using dataclasses or make these object inherit from things like UserDict (We did similar thing for pipelines).

Possible Alternatives

noklam commented 10 months ago

Initial attempt to make _FrozenDataset a "UserDict" or dictionary equivalent interface work fines.

Instead of

for dataset_name, dataset in catalog.datasets.__dict__.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

we can do

for dataset_name, dataset in catalog.datasets.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

We can do more design for this, but I think it's relatively flexible to change these because they are all internal class and won't cause any breaking issue.

This won't clean up datasets and _data_sets immediately, but would provide a cleaner user interface (interactive mode). User will still need _get_dataset for some other thing.