kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Document the `kedro run` lifecycle and hooks execution order #1718

Closed noklam closed 9 months ago

noklam commented 2 years ago

Description

Document the kedro run lifecycle with hooks execution order

This is the latest chart we have (~2 months ago, omitted the after_context_created and after_command_run hooks but we should have that in place too) image

I think this should be included in the hooks documentation page

Bonus: Add more examples in the Common Use Cases section. i.e. If you need to use the context, after_context_created is probably the way to go.

Double bonus: embed links in the diagram to link to the relevant API docs for the hooks. Maybe like this

Context

This PR enables using Mermaid to draw flow charts in our doc, so we can finally have something that is easier to version control and update.

This is an Mermaid example for our deployment charts


flowchart TD
    A{Can your Kedro pipeline run on a single machine?} -- YES --> B[Consult the single-machine deployment guide];
    B --> C{Do you have Docker on your machine?};
    C -- YES --> D[Use a container-based approach];
    C -- NO --> E[Use the CLI or package mode];
    A -- NO --> F[Consult the distributed deployment guide];
    F --> G[What distributed platform are you using?\n\nCheck out the guides for:\n<li>Argo</li><li>Prefect</li><li>Kubeflow Pipelines</li><li>AWS Batch</li><li>Databricks</li><li>Dask</li></ul>]; 
    style G text-align:left
    H["Does (part of) your pipeline integrate with Amazon SageMaker?<br/><br/>Read the SageMaker integration guide"];

flowchart TD
    A{Can your Kedro pipeline run on a single machine?} -- YES --> B[Consult the single-machine deployment guide];
    B --> C{Do you have Docker on your machine?};
    C -- YES --> D[Use a container-based approach];
    C -- NO --> E[Use the CLI or package mode];
    A -- NO --> F[Consult the distributed deployment guide];
    F --> G[What distributed platform are you using?\n\nCheck out the guides for:\n<li>Argo</li><li>Prefect</li><li>Kubeflow Pipelines</li><li>AWS Batch</li><li>Databricks</li><li>Dask</li></ul>]; 
    style G text-align:left
    H["Does (part of) your pipeline integrate with Amazon SageMaker?<br/><br/>Read the SageMaker integration guide"];
-- If you received an error, place it here.
-- Separate them if you have more than one.
antonymilne commented 2 years ago

This would be great to have. Just to say that it might be better to do this as two diagrams:

Or maybe it's better to colour code (if possible in mermaid) or label somehow each hook with its plugin entry point.

Basically don't feel like you just need to reproduce the diagram exactly as it is in the Miro board. Change things to whatever is clearest and most useful in the docs.

astrojuanlu commented 1 year ago

Potentially useful: how Hooks themselves are run.

Gave it a quick go, just to see if it has potential:

flowchart LR
  PluginManager -.- pm.hook --> HookRelay
  HookRelay -- foo_hook --> HookCaller --> HookSpec
  HookRelay -- foo_hook2 --> HookCaller -- plugin1 --> HookImpl1
  HookCaller -- plugin2 --> HookImpl2

Originally posted in https://github.com/pytest-dev/pluggy/issues/341#issuecomment-1063980216

stichbury commented 1 year ago

It's almost a happy 1 year birthday for this ticket 🍰

Here is a ticket that I think forms the parent of this one: https://github.com/kedro-org/kedro/issues/1940

noklam commented 1 year ago

Let's ship #1940 to get this going since this has a "low" priority and that smaller ticket has a "high" one. This one would be an useful document even it's just for ourselves internally.