Implement dbt-expectations package to enable a more robust testing framework

emielver commented 2 years ago

Describe the feature

We should leverage the dbt-expectations package to expand our suite of tests and to customise the rigour of our testing frameworks. The end goal is that we want our tests to actually be a solid part of the dbt packages that we've created, by making them customisable and flexible, while currently they are very strict and not necessarily practical.

As an example, we can use the proportion_unique_values test on the event_id in page views, with a default proportion of 95% but allowing users to modify this value.

If there are more ideas, feel free to add them as comments to this issue!

jrpeck1989 commented 2 years ago

For example, in the ecom package, you could imagine a place that users could set their expectations around their revenue or sales figures for a day, for example transaction revenue should be expected to fall between $5 and $500. This would probably be a place were we would probably allow users to set their own expectations rather than setting them for users.

Another ecom related expectation could be that all individual transactions have > 0 associated product IDs (thus checking for malformed transaction events).

Another expectation could be the proportion of events that are flagged as robots or crawlers (either from the IAB enrichment or checking the user agent), and flag an alert if there are 5% reported bots in a single day. This is an example of an expectation we could be happy to set as a general figure we can set for all users (whilst allowing them to customise it).

rlh1994 commented 2 years ago

To enable this we'll need to bump the dependancy of dbt-utils used in dbt-snowplow-utils up to [">=0.9.0", "<1.0.0"] as there's a dependancy on it dbt-expectations->dbt-date->dbt-utils.

It looks like 0.9.0 doesn't have any breaking changes itself that we would have to adjust for so we should be able to bump the upper limit for dbt-utils in dbt-snowplow-utils, but it does depreciate the cross-db functions that we make use of (87 times). Most of them (excluding current_timestamp) are now available in dbt-core but we would require dbt-core version 1.2.0 at least (currently we accept >=1.0.0, <2.0.0), and if we want to fully move away from depreciated functions then we'll need to increase the requirement again when the remaining cross-db macros are migrated into the core package.

Alternatively we can use the older version of dbt-expectations for now which at least we can develop on to unblock this.

rlh1994 commented 1 year ago

tl;dr

dbt-expectations doesn't have a tolerance value for not-null tests, and all but 1 of our unique tests are on PK columns, it doesn't make sense to add dbt-expectations in place of the standards tests dbt offers currently. However we could use it in integration or alter type tests in the future.

Types of tests

Integration tests

Currently all testing for integration tests are done on the standard tests (unique/not null) and an actual vs expected comparison. We could utlise the functionality of dbt-expectations to add a large number of standard tests to ensure nothing is broken that doesn't rely on having first generated an output dataset from the package itself. Things such as start_tstamp <= end_tstamp, both <= model_tstamp, uuid columns being 32(?) alpha-numeric characters long etc. would ensure no core-funcationlity is broken. Combining these with the actual vs expected and ensuring for each new feature+reported bug we find that a new test is added would allow for more confidence in releases and less manual work to validate the expected results in the first place. End users would never interact with these.

Data quality tests

These are what already exist in the package, checks for nulls, uniqueness etc. We could expand these to include certain regex matches using dbt-expectations, however most of this should be captured by the upfront schema validation, and anything custom would need to have been implemented by the end user anyway. The real challenge with these is even when they fail, the user is unlikely to be able to/want to edit the source data to address them - they may choose to exlcude some records but they shouldn't be editing the raw events table at source, which means regardless of the outcome of these tests there is no action for anyone to take.

Altering tests

These types of tests are designed to be run on aggregated data to look for abnormalities in behaviour/trends, and are designed for the analytics/BI engineer to have early warning of abnormalities in the data before e.g. a dashboard is published with the latest results. It is not there for them to address source data issues directly, but to trigger follow-up insight/reporting on the abnormalities and either identify the root cause or exclude the data from reporting.

Things like high bot traffic, high/low page views, abnormal session lengths etc. could all be built to enable quick alterting of core business metrics without the end-user having to build these manually. These could be implented in 1 of 2 ways, discussed next.

Implementing alerting tests

Both options below would use user-defined variables to enable/disable and configure specific tests, would run via dbt test, and would have a custom tag to separate them from data quality tests. Both can also make use of dbt-expectations but may also require custom sql tests for some.

Option 1: raw tests

Tests are written directly onto the _this_run or derived tables, and are run only via a dbt test command.

Pros:

Implements into existing files
Built in on dbt test feature
Utilise testing from dbt-expectations

Cons:

No record of historic runs/permenance of tests
Difficult for customers to add their own tests
Only alerting methods are those already available via dbt test

Option 2: tests log table

A model is written that runs and unions all tests that would be run, but these are absolute sql statements with results stored in a table of some fixed form e.g. DATE/TEST/CONFIG/RESULT. dbt test would then be run very simply on this table looking at any new records which have failed their test (might need a _this_run_tests table to only check latests test run...). This means that all history is recorded, and the end user can build any alerting they want on-top of the warehouse data (e.g. https://stackoverflow.com/questions/67706155/how-to-trigger-email-notification-to-group-of-users-from-bigquery-on-threshold).

Pros:

History of test runs
More complex logic easier to apply (build in statement rather than test scripts)
Allows for more complex alerting if the user wants to configure
Provides alert tracking data directly into a warehouse, allowing for reporting etc. on their data alerting
If done right should allow for customers to add their owns tests into the framework

Cons:

Unable to utilise dbt-expectations package directly
Extra storage in warehouse
Possibly lots of macros or messy union script depending on implementation
Possible complex to implement latest run only tests into dbt test

tests config in yaml

If we do use dbt-expectation tests, or go with option 1 and build some custom tests, then the configuration can be set in the USER project yaml file (due to variable precidence issue in model yaml files) and the below would go into the individual table yamls

# dbt_project.yml
...
vars:
    snowplow_web:
        snowplow__test_unique_eventid_prop: #1 for each test
            test__min_value: 0.99
            test__max_value: 1.0
            test__enabled: true

# users.yml
models:
    - name: snowplow_mobile_users
        columns:
        - name: device_user_id
            tests:
                - unique
                - not_null
                - dbt_expectations.expect_column_proportion_of_unique_values_to_be_between:
                    min_value: '{{ var("snowplow__test_unique_eventid_prop", {"test__min_value": 0.99})["test__min_value"] }}'
                    max_value: '{{ var("snowplow__test_unique_eventid_prop", {"test__max_value": 1.0})["test__max_value"] }}'
                    tags: snowplow_alert_tests
                    config:
                        enabled: '{{ var("snowplow__test_unique_eventid_prop", {"test__enabled": false})["test__enabled"] }}'

For the second option we could use the same structure in the project file, but the variables will be called in the model itself instead.

In both cases, these test variables would be defined in the documentation but not in the default web project yaml, as there are likely to be many of these and this would be for more advanced users, and we want to avoid bloating the variables section as much as possible.

Next steps

No action to be taken to replace existing tests with dbt-expectations tests
Agree if there is value in adding more complex suite of integration tests
Agree if there is value in adding alert style testing, and which method to go with

snowplow / dbt-snowplow-web