mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
MIT License
2.07k stars 100 forks source link

Dynamic Task/ParallelTask/Pipeline #101

Open leo-schick opened 1 year ago

leo-schick commented 1 year ago

Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task ParallelReadFile supports to read files (the number of files are unknown on compilation time).

I would like to have similar dynamics in other areas as well:

Dynamic nodes

The following dynamic nodes could be implemented:

Dynamic tasks

A option to give the Task a python function which is executed on pipeline runtime and returns a list of commands to execute in order.

Dynamic parallel tasks

A option to give the ParallelTask a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.

Dynamic pipeline

A option to define a DynamicPipeline where the nodes are defined within a python function which is executed on pipeline runtime.

Implement UI awareness

The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.

Implement node cost handling

These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.

Example use cases