kestra-io / kestra

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
https://kestra.io
Apache License 2.0
7.15k stars 429 forks source link

Add plugin for DataHub integration #3309

Open kriko opened 4 months ago

kriko commented 4 months ago

Feature description

Plugin request for discussion.

We are implementing DataHub for metadata cataloging and data lineage. DataHub supports a variety of sources for automated metadata collection (ingestion) - several different databases, dbt and many other tools. Check out DataHub demo. DataHub also has integration support for AirFlow.

Integrating Kestra with DataHub would allow existing Kestra users to easily implement DataHub as their metadata catalog and help existing DataHub ecosystem users perhaps replace their AirFlow for a modern orchestration solution - Kestra. ;)

Integration could have several versions or milestones - please comment and suggest how deep the integration should go, so far I have the following proposals.

Datahub plugin for Kestra - execute metadata ingestion pipelines

First level of integration could be the simplest - executing DataHub ingestion recipes from Kestra. Example of how this is done with AirFlow integration.

Easiest implementation is running DataHub ingestion Docker container with an attached recipe. Example with Docker compose. A more complicated example is available in DataHub HELM repository with a deployment example.

Kestra's implementation could/should be built on top of executing the Docker container with ingestion configuration file. The orchestration task could look like this:

  - id: execute_ingestion
    type: io.kestra.plugin.datahub.Ingestion
    # Default, but overridable
    # docker:
    #   image: linkedin/datahub-ingestion
    #   pullPolicy: IF_NOT_PRESENT
    env:
      # optional environmment variables
    recipe:
      source:
        type: mysql
        config:
          # Coordinates
          host_port: <MYSQL HOST>:3306
          database: dbname
          # Credentials
          username: root
          password: "{{ secret('MYSQL_PASSWORD') }}"
      sink:
        type: datahub-rest
        config:
          server: http://<GMS_HOST>:8080        

Most likely it would be nice to have optional options, like Docker / shell etc. tasks.

As for the recipe, it could either be a string (free text YAML) or an object adhering to the DataHub recipe. ATM haven't found a JSON Schema for the recipe.yaml structure, which could be used for validating. Tho, in the first stage, it might not be necessary.

Kestra ingestion plugin for Datahub - collect data lineage from Kestra and catalog in Datahub

This would be a more complicated task and I am not sure if it would be doable in this phase. However, the idea would be to create a plugin that would publish Kestra as a (metadata) source for Datahub, thus allowing the collection of data lineage (pipelines + metadata) to Datahub.

If Kestra is used as an EL(T) tool, then DataHub could collect (or Kestra push) it's metadata to the data catalog. Ideas could be colleted from how DataHub integrates with AirFlow.

DataHub ingestion orchestration PoC - 1st step

In the first step I am working on a templated subflow that would simply execute the DataHub ingestion container (either via Kubernetes PodCreate or Docker runner).

anna-geller commented 4 months ago

Thanks for adding the issue.

Both are feasible, fully agree that Ingestion is the easiest to start with.

For ingesting kestra metadata, we can fairly easily do that in Kestra EE with Kafka as we only need to consume data from the topic, transform it to DataHub's expected format and send it regularly via a system flow. With Postgres, it will be feasible but a bit more complicated.

anna-geller commented 4 months ago

Can you say what is your main goal initially for the kestra metadata integration?

I know that people often want to see data lineage + additionally track which pipeline touches specific table/dataset and FWIW you can accomplish that in kestra already as well (and we plan to improve that, too)

kriko commented 4 months ago

As for now, I am working on orchestrating ingestion tasks for DataHub.

In regard to getting data lineage from Kestra itself, I haven't put much thought into that yet. IF there is an interest in the community, I hope to see some ideas and feedback on that.