ministryofjustice / data-catalogue

Data catalogue • This repository is defined and managed in Terraform
MIT License
2 stars 0 forks source link

Spike: investigate how to deploy custom connectors #19

Closed MatMoore closed 5 months ago

MatMoore commented 6 months ago

We've written custom connectors for

https://github.com/ministryofjustice/datahub-custom-api-source https://github.com/ministryofjustice/datahub-custom-domain-source

Currently we've just run them locally via DataHub CLI, but if we want to keep using them we should find out how to deploy them so that we can use them in the DataHub UI and schedule ingestions.

MatMoore commented 5 months ago

Looks like if we want to include custom ingestion libraries, we would need to create our own GMS docker image, that extends the linkedin/datahub-gms one to install our custom packages

Then we can override datahub-gms.image.repository in our helm values.

Via slack

datahub-actions is the image created by Acryl team that bundle the ingestion libraries and potentially some proprietary code, that’s why you don’t see it the Dockerfile in the open source repo. If you run your ingestion through the ui, the datahub-actions image will be responsible for the metadata ingestion. That said, when you a lot of non standard ingestion, i.e. custom source, custom transformers, you can bundle these custom plugins + metadata ingestion library to a new image, deploy & run in your own way

Alternatively, we could keep the build as it is and run the ingestion outside of datahub via the CLI and github actions.

This seems like it would be the easier option, but at the cost of the ingestions not all being visible in one place.

MatMoore commented 5 months ago

If we created our own build I'm assuming the images we would need to extend would be the datahub-gms one and possibly the datahub frontend one(?)

Relevant build commands:

For reference here is the datahub-ingestion Dockerfile https://github.com/datahub-project/datahub/blob/master/docker/datahub-ingestion/Dockerfile

I'm not sure how this relates to the gms image but there is this line that brings in the python packages

RUN uv pip install --no-cache -e ".[base,datahub-rest,datahub-kafka,snowflake,bigquery,redshift,mysql,postgres,hive,clickhouse,glue,dbt,looker,lookml,tableau,powerbi,superset,datahub-business-glossary]"

Presumably we can add a similar line to install any custom packages

MatMoore commented 5 months ago

Note: datahub docker images are built using python 3.10, whereas we are targeting 3.11 https://github.com/datahub-project/datahub/blob/08731055ba1df94a1f7e52b23c5d6e257b1f0c79/docker/datahub-ingestion-base/Dockerfile#L27

MatMoore commented 5 months ago

I'm not 100% clear on how the GMS interacts with the python ingestion code.

It seems that the first thing we would need to customise would be the ingestion-cron image, deployed in this chart https://github.com/acryldata/datahub-helm/tree/89c92c8ac73b4dc371d647216d60dff28cc7c9ae/charts/datahub/subcharts/datahub-ingestion-cron

Not sure if that is it, or if there are other things to modify.

jemnery commented 5 months ago

Right now I'm leaning towards the GHA option:

Accepting that there's a drawback to having ingestions in 2 places, another downside is the GHA actions setup could get messy or confusing when we consider which DataHub instance is the target for the ingestion.

MatMoore commented 5 months ago

Decision: we will use Github actions to run any custom connectors

Follow up task: https://github.com/ministryofjustice/find-moj-data/issues/291