ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
https://find-moj-data.service.justice.gov.uk/
MIT License
5 stars 0 forks source link

Ingest curated layer databases and tables from cadet #862

Closed LavMatt closed 3 weeks ago

LavMatt commented 1 month ago

Curated databases are created in dbt via a github action https://github.com/moj-analytical-services/create-a-derived-table/blob/main/.github/workflows/deploy-curated-prod.yml and more specially by a python script

These databases relate to what we had previously confused with sources and are the data we should be ingesting relating to source system metadata (e.g. nomis, oasys).

They are classed as models by dbt and appear as such in the manifest but they do not have model configuration files that persist in the cadet repo (like other models do). The above action creates the models via the script and templates within memory and so need to be tagged to display in the catalog slightly differently.

Each curated database has a a template yaml file and it is this file where the dc_display_in_catalogue tag should be defined, e.g. nomis

Or could potentially be added within the dbt_projects file.

We should establish a process for adding the display tags to these "source" models created via the script and templates in advance of asking DMET to agree which databases they would like to surface in find-moj-data

MatMoore commented 1 month ago

Some further context on usage:

The curated models contain raw data that has been conformed to SCD2 - so that analysts can see a snapshot at any point in time. These tables are the ones that analysts actually use.

The DBT sources are the raw data prior to this transformation, which is not actually useful to anyone. (This didn't used to be the case, but this transform was at some point moved into DBT)

MatMoore commented 1 month ago

I recommend doing this for a single model first, since it's hard to roll back if we ingest the wrong thing.

MatMoore commented 1 month ago

Expecting 1103 models from this. Of these there are only 596 distinct table names so we can expect some duplication / similar looking entries.

They are organised as follows

        models/courts/caseman/
        models/courts/caseman_preprocessed/
        models/courts/xhibit/
        models/opg/sirius/
        models/opg/sirius_preprocessed/
        models/prison/book_secure_move/
        models/prison/nomis_dbt/
        models/prison/nomis_preprocessed/
        models/prison/nomis_sensitive/
        models/probation/delius/
        models/probation/delius_historic_imputed/
        models/probation/delius_historic_unmodified/
        models/risk/oasys/

So our existing ingestion code should add domains and databases, and the models should inherit any metadata set on the domain level in dbt_project.yml.

I'm going to try adding the tags for oasis first and see if it comes through as expected.

MatMoore commented 1 month ago

Within these subdirectories we still don't want to ingest all of the models.

These generated models still represent multiple stages of transformation, and we only care about the result.

We have intermediate steps which we need to exclude, and there is a long list of these. For example, for oasys the template files are:

oasys__{table_name}.sql         oasys__{table_name}_ddv.sql     oasys__{table_name}_wap.sql
oasys__{table_name}.yml         oasys__{table_name}_hashes.sql

Where oasys__{table_name} is the final result.

Also, dbt resource paths don't let you include wildcards, so if we add the tags in dbt_project.yml, we will need to specify each and every one.

Here is an approach that modifies the generate_models.py script instead of dbt_project.yml: https://github.com/moj-analytical-services/create-a-derived-table/pull/2137/files

MatMoore commented 1 month ago

Further PRs:

MatMoore commented 4 weeks ago

Just waiting for this to be picked up by the scheduled ingestion now.