Feat: enable LeanerQuery to get BQ audit data from multiple projects

joenelson1 commented 3 months ago

This update introduces the ability for LeanerQuery to query multiple BQ audit log datasets across projects and combine them in one view for easy analysis. Things done in this PR:

New setup file, leaner_query_setup.py, which parses your dbt_project.yml file, creates corresponding data sources, and modifies stg_bigquery_audit_log__data_access.sql to union all desired BQ datasets.
- dim and report models modified to include project ID where appropriate.
Modified readme to include new requirements for using multiple projects (or any change to the default config).

CLAassistant commented 3 months ago

All committers have signed the CLA.

bobsamuels commented 3 months ago

@joenelson1 My concern about having a script that needs to be executed manually is that any/every environment that uses this package would need to run this script, right? So, any orchestration tool that installs dbt and deps as part of the pipeline execution (ie on a clean image), would need to execute the script versus call dbt deps. I think our own Prefect flows will fail unless we run the python script as part of the flow(s), right?

duncan771 commented 3 months ago

@joenelson1 Pulled this PR to run it locally. Found a few things:

We will need to add _is_relation and _is_ephemeral helper functions to the union_relations macro and remove the usage of any dbt_utils calls in that macro
It looks like the two cloudaudit_googleapis_com_data_access models that I tested (Global and Reporting) can't be unioned correctly because the Global version has an extra field in the protopayload_auditlog struct. Not sure the best way to handle this part.

joenelson1 commented 3 months ago

@joenelson1 Pulled this PR to run it locally. Found a few things:

We will need to add _is_relation and _is_ephemeral helper functions to the union_relations macro and remove the usage of any dbt_utils calls in that macro

It looks like the two cloudaudit_googleapis_com_data_access models that I tested (Global and Reporting) can't be unioned correctly because the Global version has an extra field in the protopayload_auditlog struct. Not sure the best way to handle this part.

Thanks! I added the existing helper functions to the macro. I see the other problem you mentioned... not sure how to solve it. One other thing I noticed while looking at this is that there are columns which have different orders in each dataset (resource, for instance). Either union_relations isn't going to work or I'll have to define what fields we're receiving.

joenelson1 commented 3 months ago

@joenelson1 My concern about having a script that needs to be executed manually is that any/every environment that uses this package would need to run this script, right? So, any orchestration tool that installs dbt and deps as part of the pipeline execution (ie on a clean image), would need to execute the script versus call dbt deps. I think our own Prefect flows will fail unless we run the python script as part of the flow(s), right?

I think you're right. I am not sure how to automatically run a script when a package is installed, and the things I'm doing in the script can't really be done via Jinja. I'll see what I can come up with...

duncan771 commented 2 months ago

@joenelson1 I'm still poking around in this, I think we might be able to use codegen along with some on-run-start functionality built into the dbt_project.yml file. At least, in theory. Would require some testing for sure.

grafana / dbt_leaner_query

Feat: enable LeanerQuery to get BQ audit data from multiple projects #19