kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Render concrete parameter/return types for dataset `save`/`load` #2199

Closed deepyaman closed 1 month ago

deepyaman commented 1 year ago

Description

The parameter type for a dataset's save is _DI, and the return type for load is _DO. This doesn't make any sense to a user (and barely makes sense to me, as a Kedro developer, because I saw the typing PR go in and have that context). I'd rather it not display any types, if getting the correct concrete types is too difficult.

image
antonymilne commented 1 year ago

This would indeed be great to fix but I think only someone very brave should attempt it as I can see it being very frustrating and difficult to achieve... Do you have any idea how you could actually get Sphinx to do that without us having to manually override every single page of automatically generated docs for datasets? Possibly we can edit a higher-level rst template to do it?

merelcht commented 1 year ago

To do: Try updating to the newest version of sphinx and see if it has improved.

stichbury commented 1 year ago

We are using a later version of Sphinx and this is still not resolved.

It may be something that @astrojuanlu has familiarity with but if it's a non-trivial fix I suggest we put it as a very low priority. Could we potentially have an explanation of what is going on somewhere so if people want to know what it means, they can search and find the reasoning?

astrojuanlu commented 1 year ago

Upgrading sphinx or sphinx-autodoc-typehints proved to create lots of headaches in the past. I agree the current result is less than ideal but I don't think this is trivial to fix at all. It might be easier to actually redesign the datasets...

astrojuanlu commented 1 year ago

It doesn't look much better on latest with the spun-off kedro_datasets https://docs.kedro.org/en/stable/kedro_datasets.pandas.CSVDataSet.html

image

The root cause is that the heavy lifting is done by _load, which is a private method. Conversation about this is scattered around different issues (#1778, #1936, #2409 to name a few) but essentially I think the only sane way to fix this is to come up with a design that favours composition over inheritance, something that looks more similar to the proposal in https://github.com/kedro-org/kedro/issues/1936#issue-1410411315.

astrojuanlu commented 8 months ago

I think it's easier to just wait on a redesign of our catalog and datasets #1778.

deepyaman commented 4 months ago

Desired output

We can redefine load and save in each class to produce our desired result:

from pandas.util._decorators import doc  # We could roll our own, simpler version

class LambdaDataset(AbstractDataset):
    ...

    @doc(AbstractDataset.save.__doc__)
    def save(self, data: int) -> None:
        super().save(data)

    @doc(AbstractDataset.load.__doc__)
    def load(self) -> pd.DataFrame:
        return super().load()
image

Note that clicking through to the code on the docs will show that generated code, so maybe we should make it explicit that this is a generated function in the comments:

image

Possible implementations

Now, the question is—how do we accomplish this for all datasets? It is possible to either:

  1. Actually add the load and save methods to each dataset implementation. There is probably a negligible performance hit from the added function call (since I/O is almost certainly the dominant part of any reasonable dataset). It's not the prettiest from a code perspective, though, where you'd rather rely on inheritance.
  2. You can generate the additional methods just for docs generation.
  3. Is there another way to modify classes parsed by Sphinx before docs generation?

Notes

astrojuanlu commented 4 months ago

We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:

https://github.com/kedro-org/kedro/blob/c8a0e22464f2d229e26ec7fc138436c8863e3dcc/kedro/io/core.py#L190-L202

deepyaman commented 3 months ago

We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:

  • Given that this would mainly be the presentation layer, maybe it would be worth exploring a Sphinx extension instead.

    • However, creating Sphinx extensions can be an interesting exercise.
    • @deepyaman wasn't particularly motivated to pursue this approach but didn't close the door on somebody else doing it.
    • This is probably something that wouldn't be useful to the broader community, given that this is a very specific situation caused by the design.
  • We discussed about the datasets design (xref Re-design io.core and io.data_catalog #1778)

    • Of course we acknowledged that it will take us ~months to get there, so it would be nice to have an interim solution
    • We started discussing whether we could get rid of the custom error handling in AbstractDataSet (see below) given that it's directly affecting this issue and it's making it difficult for users to debug their dataset errors Easier CustomDataset Creation #1936 (comment) but we didn't have enough time to discuss

https://github.com/kedro-org/kedro/blob/c8a0e22464f2d229e26ec7fc138436c8863e3dcc/kedro/io/core.py#L190-L202

Thanks @astrojuanlu! Just copying in my summary, but I think you covered it:

It sounds like overriding load/save may be workable, there’s a potential weekend project if somebody wants to write their own Sphinx plugin, but maybe the first followup is to see whether we can get rid of _load and _save altogether?

I have some ideas on the latter, and I will investigate this.