Spike: how do we get more complete database level information?

seanprivett commented 1 month ago

Context

We want to include a 'Database' level (implemented as a Container) in DataHub for all derived datasets coming from CaDeT. We want this so that we can add descriptive information to the Database entity as it will be useful for users (as demonstrated by the amount of Database-level data added by users to the Data Discovery Tool).

CaDeT does not build a 'Database' level, but they do define a name for this level (schema_name).

So far we have inferred the Database name from the dbt dataset path and created these at the point of ingestion. This means that there is no definition source files for Databases, and so there is no 'source' for Database metadata (other than any metadata added to DataHub itself).

How do we go about creating these Database metadata sources?

One of the pieces of metadata we would like to include is the 'display name' of the database, as the current names generated from CaDeT use underscores.

All metadata we would like to include:

Descriptions
Owners
Slack/teams channels
How to access
~Domains~ Databases are already assigned a subject area (domain) dependant on the cadet folder structure (same as models)
Readable names

Should these be mandatory at the database level? If they are not assigned a domain, databases won't be discoverable from browse.

Is it safe to make assumptions based on table level metadata? E.g. do all tables in the same database have the same owner and domain?

LavMatt commented 1 week ago

Option 1 - Adding database level metadata to the dbt manifest

Initially explored if there was a way to get database level (schema in dbt language) metadata into the manifest.json file produced by dbt that holds all the create-a-derived-table metadata and is used by datahub to ingest it.

In short, there does not appear to be a way to do this

I looked at adding the properties to the +meta key in the database level of the dbt_project.yml file, e.g:

      prison_population_history:
        +meta:
          database_description: It's prison population data
          dc_owner: a.person@justice.gov.uk
          dc_slack_channel_name: "#ask-somebody-something"
          dc_slack_channel_url: https://moj.enterprise.slack.com/archives/1234
          dc_readable_name: Prison population history

But these properties are then inherited and set for every table in the database and in the manifest. It would result in a lot of additional custom properties for tables in datahub and wouldn't serve to populate anything for the database entities we've created in datahub

LavMatt commented 1 week ago

Option 2 - Database metadata collection process outside of dbt but within cadet

Due to the above not being possible i have worked up a proposal for an approach to collect database metadata within the create-a-derived-table github repo but outside of any dbt functionality. It feels more appropriate to have the database metadata be part of create a derived table than an external source repository.

I've raised a PR in create-a-derived-table to better demonstrate the approach https://github.com/moj-analytical-services/create-a-derived-table/pull/2052

In summary it proposes holding database metadata yaml files (one per database) in a dedicated folder, with a format:

database_metadata:
  name: "example_derived"
  description: "This is an example description for an example database, it has some example tables"
  dc_owner: "some.person@justice.gov.uk"
  dc_slack_channel_name: "#ask-somebody-something"
  dc_slack_channel_url: "https://moj.enterprise.slack.com/archives/1234"
  dc_readable_name: "example database"

a script then runs via github actions deploy dbt docs workflow that creates a consolidated json file containing all database metadata that is saved to s3 at the same location we pick up the manifest file from (no extra permissions required).

We can then add a small bit of custom code to our cadet ingestion to pick up the database metadata contained within the new json file.

LavMatt commented 1 week ago

I am going to join the analytics engineering practice forum on the morning of thurs 5 sep to discuss the proposal more wider with users of cadet

LavMatt commented 1 week ago

now meeting analytics engineers on 10 Sep - blocked until after that meeting where the proposal will be shared

LavMatt commented 5 days ago

After meeting with some of the analytics engineers - and running through my Option 2 proposal i've agreed a position.

To summarise, they are happy to proceed with Option 2 for database metadata collection and will begin populating some of the metadata for the cadet databases into the PR i have made.

There are the following caveats though:

I’ve agreed to meet with them again in a month’s time to review how they are finding the process. Is it easy to use and maintain, are there any super annoying pain points, etc?
It was mentioned it could potentially be easier to populate/maintain an already existing central dbt config file (dbt_project.yml). This comes with some disadvantages as highlighted in Option 1 above but could be revisited if there are unforeseen pain points with Option 2. It wouldn't be super difficult to migrate metadata already completed for databases.

I will make a ticket to add to our ingestion code to use the databse metadata that will be created and saved to s3

ministryofjustice / find-moj-data