Re-design io.core and io.data_catalog

antonymilne commented 2 years ago

Spun out of https://github.com/kedro-org/kedro/issues/1691#issuecomment-1176679924... Let's collect ideas here on what current problems are with io. To me it feels like we've neglected it and it's ripe for a re-design.

Note. Like the configuration overhaul, some of this would be non-breaking (behind the scenes implementation) and some would be breaking (changes to public API). In theory we're free to do as we please with any function starting with _, i.e. we can remove, rename, changes arguments at will. In practice, however, some of these might be very commonly used (e.g. self._get_load_path() used in lots of datasets) so should not be regarded as non-breaking.

1691 and #1580 are actually just symptoms of a more fundamental underlying issue: the API and underlying workings of io.core and io.data_catalog are very confusing and should be rethought in general. These are very old components in kedro and maybe some of the decisions that were originally made about their design should be revised. I think there's also very likely to be old bits of code there that could now be removed or renamed (e.g. who would guess that something named add_feed_dict is used to add parameters to the catalog?). It feels like tech debt rather than intentional design currently.

I don't think they're massively wrong as it stands, but I think it would be a good exercise to go through them and work out exactly what functionality we should expose in the API and how we might like to rework them. e.g. in the case raised here there is quite a bit of confusion about how to get the filepath:

catalog.datasets is presumably the "official" route to get a dataset rather than _get_dataset, but catalog.datasets wouldn't allow a namespaced datasets to be accessed without doing getattr. There's some very subtle and non-obvious differences between datasets and _get_dataset, and then there's also catalog._data_sets (which I think might just be a historical leftover... but not sure). In https://github.com/kedro-org/kedro/pull/1795 @jmholzer used vars(catalog.datasets)[dataset_name].
it also seems at a glance that _filepath is only defined for versioned datasets (? seems weird)
to actually get the correct versioned filepath it's even harder - in our datasets we do get_filepath_str(self._get_load_path(), self._protocol) which is pretty obscure. Similar to #1654

So I think we should look holistically at the structures involved here and work out what the API should look like so there's one, clear way to access the things that people need to access. I actually don't think this is such a huge task. Then we can tell much more easily whether we need any new functionality in these structures (like a catalog.dumps) or whether it's just a case of making what we already have better organised, documented and clearer.

antonymilne commented 2 years ago

@noklam also commented that we should consider what actually belongs to AbstractDataSet and what belongs to the implementations. Just to bring @noklam's comment to life a bit more, since it's something I've often thought about in the past too. We have the following bit of code repeated 20 times throughout our datasets:

    def _release(self) -> None:
        super()._release()
        self._invalidate_cache()

    def _invalidate_cache(self) -> None:
        """Invalidate underlying filesystem caches."""
        filepath = get_filepath_str(self._filepath, self._protocol)
        self._fs.invalidate_cache(filepath)

and the following is repeated 37 times:

load_path = get_filepath_str(self._get_load_path(), self._protocol)

_release is not an abc.abstractmethod so doesn't have to be supplied. Why does it exist separately in so many datasets? Why do we need to access so many protected members (e.g. self._fs, self._get_load_path(), etc.)?

Is there anything we can do to make it easier to define a custom dataset? e.g. why is _describe a required method. Overall it feels to me like we have more boilerplate in dataset implementations than we really need.

noklam commented 2 years ago

Adding this while I am looking at #1768 some object store related issues. Currently, we actually put this into CustomDataSet where it needs to access the self._fs, at least 99% of the case are copied & paste and identical to this.

exists_function=self._fs.exists,
glob_function=self._fs.glob,

A potential solution for #1768 may be passing some arguments into this line, but there is no easy way to pass in any arguments. I am also not sure how glob works across all different filesystems, it is actually surprising that it works currently, I guess glob is an implicit API across different filesystems?

https://github.com/kedro-org/kedro/blob/f4914201d7f6f38318c2c1f074fcdf802b3e1e0d/kedro/io/core.py#L537

jmholzer commented 2 years ago

A really important issue IMO and a great write-up.

catalog.datasets is presumably the "official" route to get a dataset rather than _get_dataset, but catalog.datasets wouldn't allow a namespaced datasets to be accessed without doing getattr. There's some very subtle and non-obvious differences between datasets and _get_dataset, and then there's also catalog._data_sets (which I think might just be a historical leftover... but not sure).

I think the class (_FrozenDatasets) that catalog.datasets returns an object of is a good candidate for refactoring:

It has no simple interface for the datasets it contains. Currently, the only linter-friendly ways are to use vars(catalog.datasets)[dataset_name] or catalog.datasets.__dict__[dataset_name]. I don't feel this level of (read) encapsulation is merited for an object assigned to a public attribute.
The class is poorly documented; it would be good to have docstrings as the purpose of this class is not easy to grok.
There is too much is going on inside __init__, delegating most of this to a few new, well-documented methods would also make this class much easier to understand.

merelcht commented 2 years ago

noklam commented 2 years ago

Added Expose version load version information when load_version=None

Another symptom here is:

self.resolve_load_version isn't doing anything, the real function that we want people to use is self._get_load_path, but it's not clearly defined in the API. We only tell people they need to override _load and _save. As a result, the redundant resolve_load_version call is there to signal that this needs to be used, but I am not convinced this is an ideal way to do so.

merelcht commented 2 years ago

Notes from Technical Design:

Catalog API

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way. https://github.com/kedro-org/kedro/blob/main/kedro/io/data_catalog.py#L169 and the _FrozenDatasets class were added to make it possible to have tab completion for catalog datasets in ipython or jupyter sessions. The PR that added this functionality is on private-kedro: https://github.com/quantumblacklabs/private-kedro/pull/84/files
In order to improve the catalog API we need to gain understanding of users needs
[x] https://github.com/kedro-org/kedro/issues/1978

Datasets API

The invalidate_cache method relies on fsspec and that's why it's not part of the AbstractDataSet. It might be possible to add it to the AbstractVersionedDataSet
As a team we need to get a better understanding of how versioning works. It's one of the most complex aspects of datasets and the catalog and also where we expect to be able to improve a lot.
[ ] https://github.com/kedro-org/kedro/issues/1979
[x] https://github.com/kedro-org/kedro/issues/1980

A related exercise is to completely re-design how the catalog and datasets work: https://github.com/kedro-org/kedro/issues/1981

astrojuanlu commented 1 year ago

One more thing (yes, I think about this issue and https://github.com/kedro-org/kedro/issues/1936 several times a day every single day):

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way.

I really love this design, and I clearly see how DataCatalog.load and .save could be the only interfaces you need, hence hiding the datasets from users.

But it turns out users want the underlying datasets for all sorts of things. Just today I had two users ask me how they could access the underlying dataset object for various "wicked" uses of Kedro (one was related with dynamic pipelines and the other with kedro-mlflow). I seem to always forget about the .datasets property, so I always recommend the protected catalog._get_dataset.

But if you think only users that should know better are using this protected method, hold your horses, because our own kedro-viz does, too!

https://github.com/kedro-org/kedro-viz/blob/a258e1fe525cd0fed7cfbfb69131877d49b6bdcf/package/kedro_viz/data_access/repositories/catalog.py#L116-L126

So I think we should definitely explore the possibility of "opening" this abstraction.

antonymilne commented 1 year ago

I am glad this is still on the radar because it also still pops into my head on a daily basis 😀

Just to add some more context and another couple of data points... Hiding datasets does indeed make sense when using kedro as a framework, but it makes life extremely difficult when it comes to writing plugins/extensions/integrations with kedro. There are IMO many non-nefarious reasons to want to pull dataset information from the catalog (especially now there's the metadata attribute I would have thought) and there should really be a public method for this.

I always recommend catalog._get_dataset too because I believe it is the best method currently available, and as per @astrojuanlu, it is the one used in kedro-viz, which gives me some confidence that it's not going to disappear any time soon. I would even go so far as to say that I consider this method "almost public" already, and when writing kedro integration in vizro I opted to use it, just as I recommended its use within kedro core itself.

astrojuanlu commented 1 year ago

Another flaw of the current inheritance model is that if users want to configure a different versioning strategy for their datasets, they have to create a custom CustomVersionedDataset and hack or break the inheritance chain. xref #1979

kedro-org / kedro

Re-design io.core and io.data_catalog #1778