tnightengale / dbt-activity-schema

A dbt-Core package for generating models from an activity stream.
GNU General Public License v3.0
37 stars 5 forks source link

Activity Registry #25

Open bcodell opened 1 year ago

bcodell commented 1 year ago

Overview

The activity API takes an activity_name as an argument, and in its current state, users can pass any arbitrary string to it and the query will execute successfully against the warehouse. The following shortcomings exist with this interface:

Proposal

To address these shortcomings, I'm proposing an Activity Registry, or in other words, a mechanism that puts guardrails in place to ensure that users can interface with individual activities in a fluid manner when creating datasets. This solution should contain the following features:

Optional features include:

Implementation

In order to achieve the above functionality, under the hood, dbt (specifically the graph) needs to be made aware of all of the activities that exist in a given project. Assuming a 1:1 mapping from activities to dbt models, there are multiple approaches to achieve this outcome:

Then, when creating dbt models with the dataset macro, users should be able to reference some object that is aware of the names of each activity for the activity stream being referenced in the dataset, and that object should allow them to browse and tab-complete activity names. This will likely require leveraging new/existing VS Code extensions - specifics tbd.

bcodell commented 1 year ago

@tnightengale looking for the following feedback:

tnightengale commented 1 year ago

@bcodell

  1. Absolutely I think this is worth pursuing 👍
  2. Overall I think you've covered the problem well.
  3. My initial thoughts on implementations are:

I think the materialization is an abuse of the abstraction in dbt, in order to add attributes to the graph. By overloading that abstraction, we would take away users' choice to use existing materializations they may need like "incremental". Therefore I think meta or tags config is a better option to assign properties to the node in the graph.

What I did when implementing this with other data teams, is just create a macro as a dict, eg:

{% macro get_activity(activity, attr="text") %}

{#
    Define a map of activites to ensure that the text for each activity
    is identical everywhere an activity is used. Can also define other
    features of an activity, for example a list of columns to include in
    the json_metadata.
#}

{% set defined_activities = dict(
        signed_up={
            "text": "signed up",
            "json_metadata": json_metadata([
                "page_id"
            ])
        },
        bought_something={
            "text": "bought something,
            "json_metadata": json_metadata([
                "item_id",
                "category_id",
                "promo_code"
            ])
        }
    )
%}

{% do return(defined_activities[activity][attr]) %}

{% endmacro %}

Then folks just access it as an attribute. For example, in a model to create an activity:

select
    customer,
    ts,
    {{ get_activity().sign_up }} as activity,
    {{ pack_json(get_activity().sign_up.json_metadata) }},

    ...

One limitation of this approach is that the users must implement this macro themselves; there's no "registration" interface.

An upside however is that it's easy to grok all the activities, and their feature json columns in one location.

Ideally, I'd like to have a solution that allows for both a registration interface, automatic schema checks on "activity" models, and most of all:

inferring activities based on the upstream dependencies of the activity stream being used in the dataset - potentially clever hack that requires limited additional implementation and doesn't change the dev experience, but it assumes that all activity schema implementations will materialize an activity stream as a dbt model (which is optional per the Activity Schema v2 spec), and it makes downstream hints potentially more challenging since this information will only be available after the project is parsed

So it feels like perhaps activity names and feature_json should be registered in a yml, either in a meta tag, or as vars in the dbt_project.yml.

Perhaps the registration key could be the model identifier:

# some yml
activities:
    activity__sign_up:
        text: "sign up"
        feature_json:
            - feat_1
            - feat_2
     activity__bought_something:
     ...

And we should provide macros like the ones above, to look up the activity in the registry yml, and pack json. By including the key in the look up, we should be able to apply schema checks to those models.

Thoughts?

bcodell commented 1 year ago

@tnightengale nice! Very different approach. The implementation makes sense, so I'm focusing this reply on areas where we disagree.

I think the materialization is an abuse of the abstraction in dbt, in order to add attributes to the graph. By overloading that abstraction, we would take away users' choice to use existing materializations they may need like "incremental".

My thought here is that all activity dbt models will be persisted as tables, and that the activity materialization could have a config parameter called materialization_strategy that defaults to table (or something more explicit like replace) but users could optionally specify incremental. My expectation is that incremental loads for activity models should follow fairly standard patterns (i.e. a self-reference to get the latest timestamp, which can then be referenced arbitrarily in the query via a macro). Then the materialization could have standard logic implemented based on whether the model should be built incrementally or not. It would look something like this:

activity.sql

{{ config(
    materialized='activity',
    text='activity name' -- optional - we can also add some string processing based on the model name
    stream=streams().stream_name -- we'll likely need an interface for a stream registry as well to support multiple streams in a single project, but will save for a separate issue, but the stream should be explicitly stated in the config to support cases where a project has multiple Activity Schemas (e.g. customers, users) but doesn't materialize the streams.
    materialization_strategy='table', -- or 'incremental'
    features = {
        'feature_1': {'data_type': dbt.type_string()},
        'feature_2': {'data_type': dbt.type_int()},
        'feature_3': {'data_type': dbt.type_timestamp()},
        'feature_4': {'data_type': 'boolean'},
    }

)}}

with base_cte as (
    select
        id as entity_id,
        created_at as ts,
        a as feature_1,
        b as feature_2,
        c as feature_3,
        d as feature_4
    from {{ ref('stg_model') }}
    {% if is_incremental() %}
    where created_at > {{ max_activity_ts() }}
    {% endif %
)

One note on the above - the example code would require an override of the is_incremental macro, but the code for it hasn't changed in over 2 years, so porting the logic for backwards compatibility and adding some additional to accommodate this use case seems straightforward and stable enough.

Then for the query itself, the final select statement for every activity should look roughly like the following to enforce schema consistency:

select
    -- port surrogate_key from dbt_utils to reduce dependency on other packages
    cast({{ dbt_activity_schema.surrogate_key(surrogate_key_fields) }} as {{dbt.type_string()}}) as activity_id
    , cast({{ project_config.entity_id }} as {{dbt.type_string()}}) as {{ entity_id }}
    , cast('{{config.text}}' as {{dbt.type_string()}}) as activity_name
    , ts
    , {{ dbt_activity_schema.pack_json(config.features.keys()) }} as feature_json
from {{cte}}

With the materialization approach, this could be appended in one of two ways:

  1. injected into the sql context variable in the materialization logic itself
  2. users call a macro like {{ dbt_activity_schema.build_activity() }} as the last line of each activity model query (and with an activity materialization, we can actually enforce usage of the build_activity macro as a method of schema enforcement)

For this to work, users would also need to specify:

Then, the registry interface would be a macro that no dbt developer needs to touch:

{% macro activity_registry(stream) %}
{ % if execute %}
{% set activities = n for n in graph.nodes.values() if n.config.materialized=='activity' and n.stream=stream %}
{% do some_preprocessing... %}
{% return(preprocessed_activities) %}
{% endif %}
{% endmacro %}

The returned value is a dictionary of dictionaries, where each top-level key is the model name of an activity model that feeds into the stream. And activities could be accessed like so:

-- specify an activity
{{ activity_registry(stream).activity_name }}
-- specify a feature
{{ activity_registry(stream).activity_name.features.feature_1 }}

The end result (i.e. how activities and features are referenced in datasets) is similar to yours, but I personally prefer an approach like this for a few reasons:

One limitation of this approach is that the users must implement this macro themselves; there's no "registration" interface.

To clarify, if we were to go with the yaml-centric approach, this would get resolved by having users define their activity metadata in the yaml and the macro could parse that yaml without users needing to register again in the macro, right? Or would they need to do both? If it's the latter, then I'd advocate away from this approach, as that redundant effort will prove to be tedious during development.

tnightengale commented 1 year ago

To clarify, if we were to go with the yaml-centric approach, this would get resolved by having users define their activity metadata in the yaml and the macro could parse that yaml without users needing to register again in the macro, right? Or would they need to do both? If it's the latter, then I'd advocate away from this approach, as that redundant effort will prove to be tedious during development.

Yes the macros would just parse the yml. I like this approach because it allows folks to be flexible about how they want to list their activities: they could do it all in one yml or in yml for each activity model.

The config() arg in a model is just another ingress to the same data that can be supplied via yml. So really, if we implement the macro to fetch the config, based on the meta field, then we allow users any of the following options to configure their activities:

  1. one large yml
  2. many yml
  3. model specific config()

We also inherit all the hierarchy that dbt does with paths and configs. For those reasons, I can't support the materialization approach. I think we can gain all the benefits by just using the meta keyword instead, without the complexity of creating a new materialization class, which actually isn't a materialization at all. It's just a way to pass configs.

Schema verification is easy: just check the config of registered activities, and run a singular test, which can be included in the package, against each of those models. I have a private macro that already does this on a client's project.

Finally, I like do like the convenience function: {{ dbt_activity_schema.build_activity() }} - but this can just be added explicitly at the bottom of the model. I don't think we need to have it be injected via a materialization. It will be awesome because it will just call to this and render everything as needed, based on the config().

bcodell commented 1 year ago

I'm fine to concede on the custom materialization as the solution to this - I think I was conflating this work with a personal disdain for the ergonomics of having to maintain relevant metadata for a single model across multiple files and file types. But that's a problem that needs to be solved by dbt, not by a dbt package. (Plus I know there are open tickets to discuss moving model config into the corresponding sql file, so I can wait for that to be implemented.)

For the registry syntax, I agree that the key should be the model identifier. We'll also need users to specify data type registered for each feature, which will be used when unpacking json during dataset compilation. We should also require users to specify which activity stream/schema associated with the activity. (I'm working on issues for how to support multiple Activity Schemas in a single project and a feature for building Activity Streams, and this tag will be beneficial for those implementations).

For schema verification, I'd personally prefer for the model fail to materialize rather than run a test after the fact. Maybe I'm thinking about this wrong, but most dbt dag runs I've seen materialize all models and then run all tests, so having the model fail to materialize prevents data that violates the schema contract to be made available downstream (e.g. in the stream itself or in datasets). This will be especially pertinent for developers who want to leverage incremental models, as those developers will need to re-run the entire pipeline if a schema test fails. I think this can effectively be solved simply by using the {{ dbt_activity_schema.build_activity() }} convenience function. But would you mind sharing the private macro you have here? It'd be helpful to see a concrete example of how you're thinking about this part of the implementation.