Open dbeatty10 opened 2 years ago
Thank you so much for your interest @dbeatty10! This is a really exciting proposal, as I'd love to see the library hooked into the existing data ecosystem. I can think of a few considerations. Hopefully this long-ish response comes off as excited and welcoming rather than off-putting or overwhelming. After you read it, let me know where/how I can help!
Right now the explanations.diff
function does two things: 1) Goes through the process of generating a query, 2) Executes the query to return potential explanations.
In the code, the crossover from step 1 to step 2 is here: https://github.com/marcua/datools/blob/595d19cf04923465fb3e527a6505b0775f3ce62f/datools/explanations.py#L271
In your proposal, you'd like to use part 1 without relying on part 2. Since part 1 isn't exposed as a public API just yet, one way to accomplish that would be to extract diff_query
(all the part 1 stuff) from diff
, and then both the dbt
wrapper and diff
can call it for execution purposes.
The explanations.diff
query doesn't just "statically compile" SQL---it has to execute queries in order to know how to parametrize the SQL it's generating. This is a bit tougher to solve than the previous issue. Here are a few examples:
The first issue seems easier to solve --- we can turn the support query into a subquery, and even benefit from one less round-trip to the database. The second issue will require a little bit of thought. Can dbt
handle "I need to run a query inside this macro," or should we brainstorm ways to push the bucketing/rewrite logic into the query as well?
If neither queries-inside-macros nor pushing down the queries works, we can still explore an alternative where bucketing is its own macro distinct from diffing, but I'd love to avoid changing the diff
API if at all possible.
datools.diff
While diff
is certainly the most exciting part of datools
, there are other query generation goodies. For example, as part of making diff work on postgres, I've been implementing GROUPING SETS
support for databases that don't implement grouping sets natively (e.g., SQLite, Redshift) and ones that do (e.g., DuckDB, Postgres). You could imagine dbt
macros for that as well!
I'm interested in prototyping the following proposal.
Goal
Enable the
DIFF
operator within a dbt project.Implementation proposal
diff()
macro as the main interfacePotential syntax
Examples of dbt macros
These two dbt packages contain macros that might be useful for inspiration:
For example: