marcua / datools

Other
115 stars 4 forks source link

dbt diff() macro #42

Open dbeatty10 opened 2 years ago

dbeatty10 commented 2 years ago

I'm interested in prototyping the following proposal.

Goal

Enable the DIFF operator within a dbt project.

Implementation proposal

Potential syntax

with
this_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-28' and '2018-09-04'

),

last_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-21' and '2018-08-28'

)

{{ datools.diff(this_week, last_week, on=["app_version", "device_type", "os"], compare_by="risk_ratio", threshold="2.0", support="0.05", max_order="1") }}

Examples of dbt macros

These two dbt packages contain macros that might be useful for inspiration:

For example:

marcua commented 2 years ago

Thank you so much for your interest @dbeatty10! This is a really exciting proposal, as I'd love to see the library hooked into the existing data ecosystem. I can think of a few considerations. Hopefully this long-ish response comes off as excited and welcoming rather than off-putting or overwhelming. After you read it, let me know where/how I can help!

Separating query generation from query execution

Right now the explanations.diff function does two things: 1) Goes through the process of generating a query, 2) Executes the query to return potential explanations.

In the code, the crossover from step 1 to step 2 is here: https://github.com/marcua/datools/blob/595d19cf04923465fb3e527a6505b0775f3ce62f/datools/explanations.py#L271

In your proposal, you'd like to use part 1 without relying on part 2. Since part 1 isn't exposed as a public API just yet, one way to accomplish that would be to extract diff_query (all the part 1 stuff) from diff, and then both the dbt wrapper and diff can call it for execution purposes.

Query generation executes queries as well

The explanations.diff query doesn't just "statically compile" SQL---it has to execute queries in order to know how to parametrize the SQL it's generating. This is a bit tougher to solve than the previous issue. Here are a few examples:

The first issue seems easier to solve --- we can turn the support query into a subquery, and even benefit from one less round-trip to the database. The second issue will require a little bit of thought. Can dbt handle "I need to run a query inside this macro," or should we brainstorm ways to push the bucketing/rewrite logic into the query as well?

If neither queries-inside-macros nor pushing down the queries works, we can still explore an alternative where bucketing is its own macro distinct from diffing, but I'd love to avoid changing the diff API if at all possible.

Generating SQL beyond datools.diff

While diff is certainly the most exciting part of datools, there are other query generation goodies. For example, as part of making diff work on postgres, I've been implementing GROUPING SETS support for databases that don't implement grouping sets natively (e.g., SQLite, Redshift) and ones that do (e.g., DuckDB, Postgres). You could imagine dbt macros for that as well!