kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Clarify documentation for matplotlib datasets #2536

Open astrojuanlu opened 1 year ago

astrojuanlu commented 1 year ago

Description

The documentation to visualize plots with matplotlib contains this snippet:

import matplotlib.pyplot as plt
import seaborn as sn

...

def create_confusion_matrix(companies: pd.DataFrame):
    actuals = [0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1]
    predicted = [1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1]
    data = {"y_Actual": actuals, "y_Predicted": predicted}
    df = pd.DataFrame(data, columns=["y_Actual", "y_Predicted"])
    confusion_matrix = pd.crosstab(
        df["y_Actual"], df["y_Predicted"], rownames=["Actual"], colnames=["Predicted"]
    )
    sn.heatmap(confusion_matrix, annot=True)
    return plt

It's very puzzling that the node returns the matplotlib.pyplot state machine through the plt variable. However, it works!

The MatplotlibWriter documentation, on the other hand, hints that the node should return the matplotlib.figure.Figure, which also worked an feels way more natural.

If returning pyplot is expected, I think we should make it more clear in both places of the docs. If we want to discourage it (yes please), then we should tweak the example in the narrative docs so it returns a Figure instead.

Related: kedro-org/kedro-plugins#529.

Context

The problem with the documentation of datasets is that the actual logic lives in _load and _save methods that are private and therefore not documented. I recall there are generic issues to "redesign the catalog" kedro-org/kedro#1778 kedro-org/kedro#1981 but I don't know if they are about this.

antonymilne commented 1 year ago

I'm surprised that this works, although looking at the matplotlib.pyplot source makes it clear why it does:

def savefig(*args, **kwargs):
    fig = gcf()
    res = fig.savefig(*args, **kwargs)
    ...

Either way, this is definitely not the intended use of the dataset and I agree we should not encourage it. Let's change the docs to return the Figure object instead.

As for documentation of datasets, you're right that this is one disadvantage of the current approach. To get around the problem at the moment we just put all the docs in the class docstring, which is not great but it's ok I think. As in the github issues you mention, IMO there's more fundamental problems with AbstractDataSet and the like, but changing from the template method pattern would be a big change (see also https://github.com/kedro-org/kedro/issues/2409#issuecomment-1490989279). So if we eventually do that, getting the better documentation would just be a nice side-benefit.

astrojuanlu commented 1 year ago

To get around the problem at the moment we just put all the docs in the class docstring, which is not great but it's ok I think.

Yep... The story for multi-repository docs is not impossible, but requires some focus: https://github.com/kedro-org/kedro/issues/2072#issuecomment-1528834875

astrojuanlu commented 10 months ago

xref https://github.com/kedro-org/kedro-plugins/issues/353 and a conversation we had in Slack: https://linen-slack.kedro.org/t/15875966/i-have-a-plot-saved-as-an-image-in-my-catalog-yml-and-the-fi#6d98204f-5d9f-497b-88df-80a23980dcfa

yeah this one is tricky, because I would expect a MatplotlibDataset to save/load figures, rather than the raw PNG files but proper serialization of figures has been an ongoing theme in matplotlib for 14+ years these days you can pickle the figure, at least

The fact that some datasets are meant to be write-only (hence "artifacts") is in line with the discussions we're having in https://github.com/kedro-org/kedro/issues/1936#issuecomment-1717214883

astrojuanlu commented 7 months ago

This should not be tackled before https://github.com/kedro-org/kedro-plugins/issues/353