Closed LavMatt closed 3 weeks ago
Some further context on usage:
The curated models contain raw data that has been conformed to SCD2 - so that analysts can see a snapshot at any point in time. These tables are the ones that analysts actually use.
The DBT sources are the raw data prior to this transformation, which is not actually useful to anyone. (This didn't used to be the case, but this transform was at some point moved into DBT)
I recommend doing this for a single model first, since it's hard to roll back if we ingest the wrong thing.
Expecting 1103 models from this. Of these there are only 596 distinct table names so we can expect some duplication / similar looking entries.
They are organised as follows
models/courts/caseman/
models/courts/caseman_preprocessed/
models/courts/xhibit/
models/opg/sirius/
models/opg/sirius_preprocessed/
models/prison/book_secure_move/
models/prison/nomis_dbt/
models/prison/nomis_preprocessed/
models/prison/nomis_sensitive/
models/probation/delius/
models/probation/delius_historic_imputed/
models/probation/delius_historic_unmodified/
models/risk/oasys/
So our existing ingestion code should add domains and databases, and the models should inherit any metadata set on the domain level in dbt_project.yml.
I'm going to try adding the tags for oasis first and see if it comes through as expected.
Within these subdirectories we still don't want to ingest all of the models.
These generated models still represent multiple stages of transformation, and we only care about the result.
We have intermediate steps which we need to exclude, and there is a long list of these. For example, for oasys the template files are:
oasys__{table_name}.sql oasys__{table_name}_ddv.sql oasys__{table_name}_wap.sql
oasys__{table_name}.yml oasys__{table_name}_hashes.sql
Where oasys__{table_name}
is the final result.
Also, dbt resource paths don't let you include wildcards, so if we add the tags in dbt_project.yml, we will need to specify each and every one.
Here is an approach that modifies the generate_models.py
script instead of dbt_project.yml
: https://github.com/moj-analytical-services/create-a-derived-table/pull/2137/files
Further PRs:
Just waiting for this to be picked up by the scheduled ingestion now.
Curated databases are created in dbt via a github action https://github.com/moj-analytical-services/create-a-derived-table/blob/main/.github/workflows/deploy-curated-prod.yml and more specially by a python script
These databases relate to what we had previously confused with sources and are the data we should be ingesting relating to source system metadata (e.g. nomis, oasys).
They are classed as models by dbt and appear as such in the manifest but they do not have model configuration files that persist in the cadet repo (like other models do). The above action creates the models via the script and templates within memory and so need to be tagged to display in the catalog slightly differently.
Each curated database has a a template yaml file and it is this file where the
dc_display_in_catalogue
tag should be defined, e.g. nomisOr could potentially be added within the dbt_projects file.
We should establish a process for adding the display tags to these "source" models created via the script and templates in advance of asking DMET to agree which databases they would like to surface in find-moj-data