Open antonymilne opened 2 years ago
@noklam also commented that we should consider what actually belongs to AbstractDataSet
and what belongs to the implementations. Just to bring @noklam's comment to life a bit more, since it's something I've often thought about in the past too. We have the following bit of code repeated 20 times throughout our datasets:
def _release(self) -> None:
super()._release()
self._invalidate_cache()
def _invalidate_cache(self) -> None:
"""Invalidate underlying filesystem caches."""
filepath = get_filepath_str(self._filepath, self._protocol)
self._fs.invalidate_cache(filepath)
and the following is repeated 37 times:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
_release
is not an abc.abstractmethod
so doesn't have to be supplied. Why does it exist separately in so many datasets? Why do we need to access so many protected members (e.g. self._fs
, self._get_load_path()
, etc.)?
Is there anything we can do to make it easier to define a custom dataset? e.g. why is _describe
a required method. Overall it feels to me like we have more boilerplate in dataset implementations than we really need.
Adding this while I am looking at #1768 some object store related issues. Currently, we actually put this into CustomDataSet
where it needs to access the self._fs
, at least 99% of the case are copied & paste and identical to this.
exists_function=self._fs.exists,
glob_function=self._fs.glob,
A potential solution for #1768 may be passing some arguments into this line, but there is no easy way to pass in any arguments. I am also not sure how glob
works across all different filesystems
, it is actually surprising that it works currently, I guess glob
is an implicit API across different filesystems
?
A really important issue IMO and a great write-up.
catalog.datasets is presumably the "official" route to get a dataset rather than _get_dataset, but catalog.datasets wouldn't allow a namespaced datasets to be accessed without doing getattr. There's some very subtle and non-obvious differences between datasets and _get_dataset, and then there's also catalog._data_sets (which I think might just be a historical leftover... but not sure).
I think the class (_FrozenDatasets
) that catalog.datasets
returns an object of is a good candidate for refactoring:
vars(catalog.datasets)[dataset_name]
or catalog.datasets.__dict__[dataset_name]
. I don't feel this level of (read) encapsulation is merited for an object assigned to a public attribute.__init__
, delegating most of this to a few new, well-documented methods would also make this class much easier to understand.Added Expose version load version information when load_version=None
Another symptom here is:
self.resolve_load_version
isn't doing anything, the real function that we want people to use is self._get_load_path
, but it's not clearly defined in the API. We only tell people they need to override _load
and _save
. As a result, the redundant resolve_load_version
call is there to signal that this needs to be used, but I am not convinced this is an ideal way to do so.Notes from Technical Design:
Catalog API
The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way. https://github.com/kedro-org/kedro/blob/main/kedro/io/data_catalog.py#L169 and the _FrozenDatasets
class were added to make it possible to have tab completion for catalog datasets in ipython
or jupyter
sessions. The PR that added this functionality is on private-kedro: https://github.com/quantumblacklabs/private-kedro/pull/84/files
In order to improve the catalog API we need to gain understanding of users needs
Datasets API
The invalidate_cache
method relies on fsspec and that's why it's not part of the AbstractDataSet
. It might be possible to add it to the AbstractVersionedDataSet
As a team we need to get a better understanding of how versioning works. It's one of the most complex aspects of datasets and the catalog and also where we expect to be able to improve a lot.
A related exercise is to completely re-design how the catalog and datasets work: https://github.com/kedro-org/kedro/issues/1981
One more thing (yes, I think about this issue and https://github.com/kedro-org/kedro/issues/1936 several times a day every single day):
The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way.
I really love this design, and I clearly see how DataCatalog.load
and .save
could be the only interfaces you need, hence hiding the datasets from users.
But it turns out users want the underlying datasets for all sorts of things. Just today I had two users ask me how they could access the underlying dataset object for various "wicked" uses of Kedro (one was related with dynamic pipelines and the other with kedro-mlflow). I seem to always forget about the .datasets
property, so I always recommend the protected catalog._get_dataset
.
But if you think only users that should know better are using this protected method, hold your horses, because our own kedro-viz does, too!
So I think we should definitely explore the possibility of "opening" this abstraction.
I am glad this is still on the radar because it also still pops into my head on a daily basis 😀
Just to add some more context and another couple of data points... Hiding datasets does indeed make sense when using kedro as a framework, but it makes life extremely difficult when it comes to writing plugins/extensions/integrations with kedro. There are IMO many non-nefarious reasons to want to pull dataset information from the catalog (especially now there's the metadata
attribute I would have thought) and there should really be a public method for this.
I always recommend catalog._get_dataset
too because I believe it is the best method currently available, and as per @astrojuanlu, it is the one used in kedro-viz, which gives me some confidence that it's not going to disappear any time soon. I would even go so far as to say that I consider this method "almost public" already, and when writing kedro integration in vizro I opted to use it, just as I recommended its use within kedro core itself.
Another flaw of the current inheritance model is that if users want to configure a different versioning strategy for their datasets, they have to create a custom CustomVersionedDataset
and hack or break the inheritance chain. xref #1979
Spun out of https://github.com/kedro-org/kedro/issues/1691#issuecomment-1176679924... Let's collect ideas here on what current problems are with
io
. To me it feels like we've neglected it and it's ripe for a re-design.I don't think they're massively wrong as it stands, but I think it would be a good exercise to go through them and work out exactly what functionality we should expose in the API and how we might like to rework them. e.g. in the case raised here there is quite a bit of confusion about how to get the filepath:
catalog.datasets
is presumably the "official" route to get a dataset rather than_get_dataset
, butcatalog.datasets
wouldn't allow a namespaced datasets to be accessed without doinggetattr
. There's some very subtle and non-obvious differences betweendatasets
and_get_dataset
, and then there's alsocatalog._data_sets
(which I think might just be a historical leftover... but not sure). In https://github.com/kedro-org/kedro/pull/1795 @jmholzer usedvars(catalog.datasets)[dataset_name]
._filepath
is only defined for versioned datasets (? seems weird)get_filepath_str(self._get_load_path(), self._protocol)
which is pretty obscure. Similar to #1654So I think we should look holistically at the structures involved here and work out what the API should look like so there's one, clear way to access the things that people need to access. I actually don't think this is such a huge task. Then we can tell much more easily whether we need any new functionality in these structures (like a
catalog.dumps
) or whether it's just a case of making what we already have better organised, documented and clearer.