kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Expose `load` and `save` publicly for each dataset #3920

Open deepyaman opened 3 weeks ago

deepyaman commented 3 weeks ago

Description

Resolves #2199

In discussing #2199, the question arose, why _load and _save need to be private methods. This PR exposes them publicly, choosing to "decorate" or wrap the load and save functionality for each dataset, rather than calling them as private methods from the public base implementation.

The changes are made so as to require minimal changes to existing datasets and be backwards-compatible with custom datasets users may have written.

Development notes

Update(6/27): It's no longer necessary to write load = _load, save = _save; this is handled in __init_subclass. That said, we can still roll out enforcement of users providing load and save as described below, as desired. It's actually much less pressing, but may help simplify the code a bit. :)

Previously on this PR...

I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions:

class MyDataset(...):
    def _load(...) -> ...:
        ...

    load = _load

    def _save(...) -> ...:
        ...

    save = _save

If we want to take things slow, in the next minor release of Kedro (0.20.0), we can start raising DeprecationWarnings for datasets that don't implement save. In 0.21.0, we can drop support for datasets that don't define save (and just have the older _save).

The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

datajoely commented 3 weeks ago

Love this

Galileo-Galilei commented 3 weeks ago

This is a brilliant idea 😍 , I am jealous of not having it myself ;) I always have to explain to users who create a custom dataset that they should implement _load and _save instead of load and save, and some less advanced users struggle to get the point and this would likely help.

One small caveat is that I don't know if this would have unintended consequences on MlflowArtifactDataset which also wraps the methods (I think it shouldn't, but I'd rather check)

deepyaman commented 3 weeks ago

One small caveat is that I don't know if this would have unintended consequences on MlflowArtifactDataset which also wraps the methods (I think it shouldn't, but I'd rather check)

Will be great if you can check that! I did guard against wrapping datasets that inherit load or save multiple times, but MlflowArtifactDataset implementation looks quite different, so I'll defer to you. :)

astrojuanlu commented 1 week ago

I can't comment much on the implementation for now but I've been a long advocate of making load and save public, so big +1 from me!

We should be mindful how this affects the way people define custom datasets. Will they still need to implement _load and _save? If so, do we need to adjust our docs? And finally, is it too early to think of a world in which _load and _save are actually no longer necessary?

deepyaman commented 1 week ago

We should be mindful how this affects the way people define custom datasets. Will they still need to implement _load and _save? If so, do we need to adjust our docs? And finally, is it too early to think of a world in which _load and _save are actually no longer necessary?

I'd included some information in the development notes above, but let me copy them below:

I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions:

class MyDataset(...):
    def _load(...) -> ...:
        ...

    load = _load

    def _save(...) -> ...:
        ...

    save = _save

If we want to take things slow, in the next minor release of Kedro (0.20.0), we can start raising DeprecationWarnings for datasets that don't implement load or save. In 0.21.0, we can drop support for datasets that don't define load or save (and just have the older _load or _save).

The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core.

rashidakanchwala commented 6 days ago

this change should not affect Kedro-Viz, as we don't use any private _load and _save methods anymore. For tests, we use the public methods, and for previews - we now use the public preview method.

Galileo-Galilei commented 5 days ago

Hi, sorry there is something I don't get. I am trying to do this (inside spaceflights-pandas starter, this code is copy pasted from tour example in the docstring):

from pathlib import Path

import pandas as pd
from kedro.io import AbstractDataset

class MyOwnDataset(AbstractDataset):
    def __init__(self, filepath):
        self._filepath = Path(filepath)

    def load(self) -> pd.DataFrame:
        return pd.read_csv(self._filepath)

    def save(self, df: pd.DataFrame) -> None:
        df.to_csv(str(self._filepath))

    def _exists(self) -> bool:
        return Path(self._filepath.as_posix()).exists()

    def _describe(self):
        return dict(param1=self._filepath)

ds = MyOwnDataset(
    filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)

ds.load()

And I am getting the error :

TypeError: Can't instantiate abstract class MyOwnDataset with abstract methods _load, _save

I've double checked and I have the correct developement version installed. while debugging I am entering the __init_subclass method though :confused: so I don't really get if it's how it is intended to be used. I don't see anywhere in the code where you monkeypatch _load and ````_save in case they are not defined in the custom dataset, but maybe I am missing something.

deepyaman commented 5 days ago
TypeError: Can't instantiate abstract class MyOwnDataset with abstract methods _load, _save

I've double checked and I have the correct developement version installed. while debugging I am entering the __init_subclass method though πŸ˜• so I don't really get if it's how it is intended to be used. I don't see anywhere in the code where you monkeypatch _load and ````_save in case they are not defined in the custom dataset, but maybe I am missing something.

~Ah, this is probably right... let me try to take a look and patch that. Can probably just throw in some dummy method, since it shouldn't actually be hit.~

Edit: Actually, thinking about it more, should just make load and save abstract instead of _load and _save, to make the new interface clear. Even if just _load and _save is being implemented, you mentioned it's going into __init_subclass__, and load and save will get created there.

deepyaman commented 3 days ago

@Galileo-Galilei The specific (massive) bug you pointed out should be resolved, and, in trying to figure out how to best resolve it, I think I've also made the code MUCH cleaner. :)

There's no duplicate load/save handling logic, and the __init_subclass__:

  1. Aliases _load to load, _save to save, if it is defined for the class.
  2. Does as before (apply the load wrapper to the load, save wrapper to the save).

There's no more, what do we do if _load is defined, and how that is different from what we do when save is defined. Also, load and save are marked abstract, which makes sense.

merelcht commented 3 days ago

I don't really like how this now adds inconsistency between the abstract methods that need to be implemented. save and load vs. _describe.

deepyaman commented 3 days ago

I don't really like how this now adds inconsistency between the abstract methods that need to be implemented. save and load vs. _describe.

I'd be willing to make _describe public in a follow-up PR? No objection from my side, if that's desirable; I see no real reason _describe should be a private API.

However, I think that load and save are publicly exposed already, whereas _describe is not, which is what drives this.