kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.03k stars 906 forks source link

Deployment support/Kedro deployer #2058

Open merelcht opened 2 years ago

merelcht commented 2 years ago

Description

We want to offer users better support with deploying Kedro.

Implementation ideas

Questions

datajoely commented 2 years ago

So I love this and think we should really emphasise the power of modular pipelines here as a standout feature of Kedro. Choosing the granularity of what to translate is critical here.

Couler looks v cool btw.

noklam commented 1 year ago

https://www.linen.dev/s/kedro/t/12647848/hi-kedro-community-slightly-smiling-face-i-have-a-question-r#45fa4774-5ebd-4826-9c10-47fd95d1bebd

Something that has been done on some plugins, we should come up with a flexible way that kedro core can create the mapping for different targets.

astrojuanlu commented 1 year ago

I guess https://github.com/kedro-org/kedro/issues/143 is tangentially related?

noklam commented 1 year ago

It is! Although I am thinking more about deployment for platform/orchestrator here, but there are definitely case user deploying kedro pipeline to an endpoint.

noklam commented 1 year ago

Not sure where should it go, so I just put here as this is amateur thought. Inspired by deepyaman, it seems that it's fairly easy to convert a Kedro pipeline to Metaflow. It's just a variable assignment between steps (node in Kedro's term).

The benefit of using Metaflow is that it allows you to abstract infrastructure with a decorator like @kubernetes(memory=64000) (obviously the tricky part is to get this cluster setup properly, but from DS perspective it doesn't matter as they simply want more compute for specific task). This can be integrated more smoothly potentially with kedro's tags to denote the required infra.

This is not saying Kedro is gonna to integrate with Metaflow, but showing that the possibility of doing it.

from metaflow import FlowSpec, step

class LinearFlow(FlowSpec):

    @step
    def start(self):
        self.my_var = 'hello world'
        self.next(self.a)

    @step
    def a(self):
        print('the data artifact is: %s' % self.my_var)
        self.next(self.end)

    @step
    def end(self):
        print('the data artifact is still: %s' % self.my_var)

if __name__ == '__main__':
    LinearFlow()
astrojuanlu commented 1 year ago

Related: https://github.com/kedro-org/kedro/issues/3094

That issue contains a research synthesis, we can use this issue to collect the plan.

astrojuanlu commented 6 months ago

Found in https://github.com/kedro-org/kedro/issues/770 from @idanov:

kedro deploy airflow

(found after listening to @ankatiyar's Tech Design on kedro-airflow and reading https://github.com/kedro-org/kedro-plugins/issues/25#issuecomment-2107299300)

astrojuanlu commented 6 months ago

As we learn more about how to deploy to Airflow (see @DimedS's https://github.com/kedro-org/kedro/pull/3860) it becomes more apparent that the list of steps can become quite involved. This is not new, as the Databricks workflows also require some care.

We have plenty of evidence that this is an important pain point that affects lots of users.

The main questions are:

  1. Is there anything that can be done in general to make these processes easier, simpler, leaner? (So that we don't rush to automate something that should have been easier in the first place)
  2. Is there a way these processes can be partly automated for specific platforms of choice? (Assuming that there's no general solution for all platforms, given the low traction of couler)
    • And which platforms should we choose, and with which criteria?
noklam commented 6 months ago

We should really treat Kedro as a Pipeline DSL. Most of the orchestrator work as a DAG, so these are all generic feature. For example:

So there is definitely a theme around DAG manipulation, is the current Pipeline API flexible enough? We have to implement a separate dfs to support the grouped memory nodes feature for airflow. We can also extend the grouped memory node feature to i.e. running specific nodes on a GPU machine, spark cluster, different python envrionment, machine type.

Where will be the best place to hold these metadata? maybe tag?

There are some early vision in #770 and https://github.com/kedro-org/kedro/issues/3094 done by @datajoely last year. It would also be really cool to have some kind of DryRunner where one can orchestrate multiple Kedro Session in one go, this allow one to catch error like "memory dataset" doesn't exist in a "every kedro node as a airflow node" situation.

datajoely commented 6 months ago

Yeah - my push here is not to focus too much on Airflow and really address the fundamental problems which make Kedro slightly awkward in these situations

astrojuanlu commented 5 months ago

Potentially interesting https://github.com/run-house/runhouse (via https://www.run.house/blog/lean-data-automation-a-principal-components-approach)

astrojuanlu commented 5 months ago

Possible inspiration: Apache Beam https://beam.apache.org/get-started/beam-overview/

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    "--runner=PortableRunner",
    "--job_endpoint=localhost:8099",
    "--environment_type=LOOPBACK"
])
with beam.Pipeline(options) as p:
    ...

There's a difference though between deploying and running, this is more similar to our Dask runner https://docs.kedro.org/en/stable/deployment/dask.html

But maybe it means that we need a more clear mental model too. In my head, when we deploy a Kedro project, Kedro is no longer responsible for the execution, whereas writing a runner implies that Kedro is in fact driving the execution and acting as a poor-man's orchestrator.

datajoely commented 5 months ago

I guess there is a philosophical question (very much related to granualirity) on how you express which chunks of pipeline(s) get executed in different distributed systems.

The execution plan of a Kedro DAG is resolved at runtime, we do not have a standardised way of:

1) Declaring a group of nodes which must be run together 2) Saying what infrastructure they should run on 3) Validating if / ensuring data is passed between groups through persistent storage.

I'm not against the context manager approach for doing this in principle - but I think it speaks to the more fundamental problem of how some implicit elements of Kedro (which increase development velocity) leave a fair amount of unmanaged complexity when it gets to this point in the process.