Open deepyaman opened 3 weeks ago
Love this
This is a brilliant idea π , I am jealous of not having it myself ;) I always have to explain to users who create a custom dataset that they should implement _load
and _save
instead of load
and save
, and some less advanced users struggle to get the point and this would likely help.
One small caveat is that I don't know if this would have unintended consequences on MlflowArtifactDataset
which also wraps the methods (I think it shouldn't, but I'd rather check)
One small caveat is that I don't know if this would have unintended consequences on
MlflowArtifactDataset
which also wraps the methods (I think it shouldn't, but I'd rather check)
Will be great if you can check that! I did guard against wrapping datasets that inherit load
or save
multiple times, but MlflowArtifactDataset
implementation looks quite different, so I'll defer to you. :)
I can't comment much on the implementation for now but I've been a long advocate of making load
and save
public, so big +1 from me!
We should be mindful how this affects the way people define custom datasets. Will they still need to implement _load
and _save
? If so, do we need to adjust our docs? And finally, is it too early to think of a world in which _load
and _save
are actually no longer necessary?
We should be mindful how this affects the way people define custom datasets. Will they still need to implement
_load
and_save
? If so, do we need to adjust our docs? And finally, is it too early to think of a world in which_load
and_save
are actually no longer necessary?
I'd included some information in the development notes above, but let me copy them below:
I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions:
class MyDataset(...):
def _load(...) -> ...:
...
load = _load
def _save(...) -> ...:
...
save = _save
If we want to take things slow, in the next minor release of Kedro (0.20.0
), we can start raising DeprecationWarning
s for datasets that don't implement load
or save
. In 0.21.0
, we can drop support for datasets that don't define load
or save
(and just have the older _load
or _save
).
The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core.
this change should not affect Kedro-Viz, as we don't use any private _load
and _save
methods anymore. For tests, we use the public methods, and for previews - we now use the public preview
method.
Hi, sorry there is something I don't get. I am trying to do this (inside spaceflights-pandas
starter, this code is copy pasted from tour example in the docstring):
from pathlib import Path
import pandas as pd
from kedro.io import AbstractDataset
class MyOwnDataset(AbstractDataset):
def __init__(self, filepath):
self._filepath = Path(filepath)
def load(self) -> pd.DataFrame:
return pd.read_csv(self._filepath)
def save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(param1=self._filepath)
ds = MyOwnDataset(
filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)
ds.load()
And I am getting the error :
TypeError: Can't instantiate abstract class MyOwnDataset with abstract methods _load, _save
I've double checked and I have the correct developement version installed. while debugging I am entering the __init_subclass
method though :confused: so I don't really get if it's how it is intended to be used. I don't see anywhere in the code where you monkeypatch _load
and ````_save in case they are not defined in the custom dataset, but maybe I am missing something.
TypeError: Can't instantiate abstract class MyOwnDataset with abstract methods _load, _save
I've double checked and I have the correct developement version installed. while debugging I am entering the
__init_subclass
method though π so I don't really get if it's how it is intended to be used. I don't see anywhere in the code where you monkeypatch_load
and ````_save in case they are not defined in the custom dataset, but maybe I am missing something.
~Ah, this is probably right... let me try to take a look and patch that. Can probably just throw in some dummy method, since it shouldn't actually be hit.~
Edit: Actually, thinking about it more, should just make load
and save
abstract instead of _load
and _save
, to make the new interface clear. Even if just _load
and _save
is being implemented, you mentioned it's going into __init_subclass__
, and load
and save
will get created there.
@Galileo-Galilei The specific (massive) bug you pointed out should be resolved, and, in trying to figure out how to best resolve it, I think I've also made the code MUCH cleaner. :)
There's no duplicate load/save handling logic, and the __init_subclass__
:
_load
to load
, _save
to save
, if it is defined for the class.load
, save wrapper to the save
).There's no more, what do we do if _load
is defined, and how that is different from what we do when save
is defined. Also, load
and save
are marked abstract, which makes sense.
I don't really like how this now adds inconsistency between the abstract methods that need to be implemented. save
and load
vs. _describe
.
I don't really like how this now adds inconsistency between the abstract methods that need to be implemented.
save
andload
vs._describe
.
I'd be willing to make _describe
public in a follow-up PR? No objection from my side, if that's desirable; I see no real reason _describe
should be a private API.
However, I think that load
and save
are publicly exposed already, whereas _describe
is not, which is what drives this.
Description
Resolves #2199
In discussing #2199, the question arose, why
_load
and_save
need to be private methods. This PR exposes them publicly, choosing to "decorate" or wrap the load and save functionality for each dataset, rather than calling them as private methods from the public base implementation.The changes are made so as to require minimal changes to existing datasets and be backwards-compatible with custom datasets users may have written.
Development notes
Update(6/27): It's no longer necessary to write
load = _load
,save = _save
; this is handled in__init_subclass
. That said, we can still roll out enforcement of users providingload
andsave
as described below, as desired. It's actually much less pressing, but may help simplify the code a bit. :)Previously on this PR...
I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions:
If we want to take things slow, in the next minor release of Kedro (
0.20.0
), we can start raisingDeprecationWarning
s for datasets that don't implementsave
. In0.21.0
, we can drop support for datasets that don't definesave
(and just have the older_save
).The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core.
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file