Open merelcht opened 2 years ago
So I love this and think we should really emphasise the power of modular pipelines here as a standout feature of Kedro. Choosing the granularity of what to translate is critical here.
Couler looks v cool btw.
Something that has been done on some plugins, we should come up with a flexible way that kedro
core can create the mapping for different targets.
I guess https://github.com/kedro-org/kedro/issues/143 is tangentially related?
It is! Although I am thinking more about deployment for platform/orchestrator here, but there are definitely case user deploying kedro pipeline to an endpoint.
Not sure where should it go, so I just put here as this is amateur thought. Inspired by deepyaman, it seems that it's fairly easy to convert a Kedro pipeline to Metaflow. It's just a variable assignment between steps (node in Kedro's term).
The benefit of using Metaflow is that it allows you to abstract infrastructure with a decorator like @kubernetes(memory=64000)
(obviously the tricky part is to get this cluster setup properly, but from DS perspective it doesn't matter as they simply want more compute for specific task). This can be integrated more smoothly potentially with kedro's tags to denote the required infra.
This is not saying Kedro is gonna to integrate with Metaflow, but showing that the possibility of doing it.
from metaflow import FlowSpec, step
class LinearFlow(FlowSpec):
@step
def start(self):
self.my_var = 'hello world'
self.next(self.a)
@step
def a(self):
print('the data artifact is: %s' % self.my_var)
self.next(self.end)
@step
def end(self):
print('the data artifact is still: %s' % self.my_var)
if __name__ == '__main__':
LinearFlow()
Related: https://github.com/kedro-org/kedro/issues/3094
That issue contains a research synthesis, we can use this issue to collect the plan.
Found in https://github.com/kedro-org/kedro/issues/770 from @idanov:
kedro deploy airflow
(found after listening to @ankatiyar's Tech Design on kedro-airflow
and reading https://github.com/kedro-org/kedro-plugins/issues/25#issuecomment-2107299300)
As we learn more about how to deploy to Airflow (see @DimedS's https://github.com/kedro-org/kedro/pull/3860) it becomes more apparent that the list of steps can become quite involved. This is not new, as the Databricks workflows also require some care.
We have plenty of evidence that this is an important pain point that affects lots of users.
The main questions are:
We should really treat Kedro as a Pipeline DSL. Most of the orchestrator work as a DAG, so these are all generic feature. For example:
So there is definitely a theme around DAG manipulation, is the current Pipeline API flexible enough? We have to implement a separate dfs
to support the grouped memory nodes feature for airflow. We can also extend the grouped memory node feature to i.e. running specific nodes on a GPU machine, spark cluster, different python envrionment, machine type.
Where will be the best place to hold these metadata? maybe tag?
There are some early vision in #770 and https://github.com/kedro-org/kedro/issues/3094 done by @datajoely last year.
It would also be really cool to have some kind of DryRunner
where one can orchestrate multiple Kedro Session in one go, this allow one to catch error like "memory dataset" doesn't exist in a "every kedro node as a airflow node" situation.
Yeah - my push here is not to focus too much on Airflow and really address the fundamental problems which make Kedro slightly awkward in these situations
Potentially interesting https://github.com/run-house/runhouse (via https://www.run.house/blog/lean-data-automation-a-principal-components-approach)
Possible inspiration: Apache Beam https://beam.apache.org/get-started/beam-overview/
Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
"--runner=PortableRunner",
"--job_endpoint=localhost:8099",
"--environment_type=LOOPBACK"
])
with beam.Pipeline(options) as p:
...
There's a difference though between deploying and running, this is more similar to our Dask runner https://docs.kedro.org/en/stable/deployment/dask.html
But maybe it means that we need a more clear mental model too. In my head, when we deploy a Kedro project, Kedro is no longer responsible for the execution, whereas writing a runner implies that Kedro is in fact driving the execution and acting as a poor-man's orchestrator.
I guess there is a philosophical question (very much related to granualirity) on how you express which chunks of pipeline(s) get executed in different distributed systems.
The execution plan of a Kedro DAG is resolved at runtime, we do not have a standardised way of:
1) Declaring a group of nodes which must be run together 2) Saying what infrastructure they should run on 3) Validating if / ensuring data is passed between groups through persistent storage.
I'm not against the context manager approach for doing this in principle - but I think it speaks to the more fundamental problem of how some implicit elements of Kedro (which increase development velocity) leave a fair amount of unmanaged complexity when it gets to this point in the process.
Description
We want to offer users better support with deploying Kedro.
Implementation ideas
Questions