bastienboutonnet commented 2 years ago

Context

Users who use dbt have spent time crafting tests in their transformations. Because we don’t want users to repeat themselves to benefit from Soda Cloud’s visibility and incident management we plan to:

Ingest dbt test results into Soda Cloud.
Provide clear indications of whether the tests are maintained in soda-sql or ingested from dbt
Allow collaboration and reporting to be done on those results just as if they were maintained from soda-sql

User Flow

Users should be able to call a command such as soda ingest --tool dbt, provide soda-sql with paths to the dbt artifacts, and have their tests results show up in the Soda Cloud UI. We will have to be clear in our documentation that ingestion should be orchestrated once dbt has been invoked and the transformations have completed. Soda-sql (at least in this first version) will not integrate with the core dbt API. This means it will not orchestrate a dbt run or be called by dbt. Those scenarios may be revisited in further iterations of our integration.

We may also consider creating a new soda command such as soda capture which would run soda scan and follow it with soda ingest or a flag on soda scan such as --ingest dbt. Up for discussion and TBD once minimal isolated approach is working. This is sugar coating at this point.

Overall Approach

In this v1, we want to keep it simple. dbt generates two valuable artifacts: manifest.json and run_results.json. When it comes to test results, we can parse the results from the run_results.json and use the info in the manifest.json to gain knowledge about the table (dataset in Soda's language) that it maps on to.

Once we have parsed all the info we need (see below) we can compose a TestObject similar (but not exactly) to the soda-sql TestObject that is then exposed to the Soda Cloud backend API. The UI will show the ingested tests on the dataset view, those tests will be clearly identified as external and display a dbt logo or whatever UI choice makes sense.

Specific Requirements

We will want to parse and represent the following information from dbt test results:

test_run_at: timestamp of when the test was run.
test_result: whether the test is passing or failing.
severity: if this info is available (I know it is for freshness but maybe not for all).
info that allows us to generate a fully qualified table name:
- database
- schema
- table
tool_name: in this case it would be dbt (this information will be used by the Soda Cloud UI to display the test provenance and govern available attributes and details)
compiled_sql: dbt should give us access to the compiled sql snippet that it ran for the test. We may want to parse this as we can show it in the Soda Cloud UI and it is valuable information that users can use as a starting point for investigation and issue resolution. If users have enabled store_failures we should be able to obtain the select statement where they can visualise those failures. If this is possible we could consider adding those to the requirements. More info on this can be found here: https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures

Concious Limitations

v1 will only support "generic tests" : this means dbt builtins as well as the ones implemented via macros and added to models or columns of those models.
v1 will not support "bespoke tests" unless they are following the best practice of {{ ref() }} in their declaration and in that case the mapping should be available in a structured way just like any other tests (manifest.json and `run_results.json)
v1 will not attempt to capture results from store_failures (this may be something we offer as a stretch goal as if fits our failed_rows model (see last point of Specific Requirements
v1 will be limited to ingesting results from dbt-core (OSS). A future iteration will know how to talk to the dbt Cloud API to leverage similar information.

JCZuurmond commented 2 years ago

@bastienboutonnet : thanks for doing this write up!

My plan is as follows:

Add an ingest command to the cli
Parse the manifest.json and run_results.json to get the required test and model info from the dbt artifacts.
Map the parsed dbt tests and models to a Soda test result
- test_run_at <- TimingInfo
- test_result <- TestStatus
- database and schema <- HasRelationMetadata.
- table <- ParsedNodeMandatory.alias
- compiled_sql <- RemoteCompileResultMixin
- severity <- TestConfig.Severity
Tests and models are matched as follows:
1. Get tests from run_results
2. Search for test in manifest using the unique_id
3. Get the models from the depends_on
4. Look for models in the manifest using the unique_id
Flush the test results to the Soda cloud

bastienboutonnet commented 2 years ago

Hey @JCZuurmond, I think your plan sounds really good to me!

Where I see there will be a little back and forth might be around the mapping work for two reasons:

we'll want to double check we're passing the right info (for example ParsedNodeMandatory.alias sounds strange to me that it's an alias, but it may just be strange naming. As far as I know the manifest.json has fields for database, schema and another one for the actual table name but since you're using dbt internals to parse it it may end up there. I haven't gone deep into their objects. But if you checked then its fine.
the TestResult that is currently implemented in soda, needs a scan time. We're discussion with @dirkgroenen whether we will open a new set of APIs on the backend that relax that contract but there are quite a few things that the backend needs to figure out with this as it forces the backend to come up with a new way to manage alert notifications lifecycle.
That being said, for now, we can totally pretend that we will work against the TestResults contract so that you don't have to worry about this for the time being.

bastienboutonnet commented 2 years ago

Quick update on one of the bullet points I made earlier:

the TestResult that is currently implemented in soda, needs a scan time. We're discussion with @dirkgroenen whether we will open a new set of APIs on the backend that relax that contract but there are quite a few things that the backend needs to figure out with this as it forces the backend to come up with a new way to manage alert notifications lifecycle.

The backend won't commit, at this point, to have those new API endpoints ready for this version. So we'll have to make do with the current limitations. This means:

When we're about to compose a TestResult object, we will need to come up with a scan_time to satisfy the requirements of the backend API contract.

vijaykiran commented 2 years ago

Is there some kind of time available in manifest that we can use?

bastienboutonnet commented 2 years ago

We could use the time of the run from the run_results.json I believe yes.

JCZuurmond commented 2 years ago

Note: we do not use the time mentioned in the run_results. The time mentioned there is per test, where we want one per table. The current implementation inserts the time for when the dbt ingest is executed.

bastienboutonnet commented 2 years ago

I think for now, that will be a caveat we have to live with...

We'll have to make sure to document it quite well. I'll add that to my list when I talk with @janet-can for documenting this feature.

I think in a second iteration @vijaykiran we might want to have a tighter integration indeed.

I'm copying one of @JCZuurmond 's message from the community slack here for posterity:

An alternative, interesting approach would be to replace the dbt test functionality with the soda test functionality (similar to this repo https://github.com/dmateusp/dbt-opentelemetry). The way-of-working I am envisioning is that a user install dbt-soda and all test will be executed trough the Soda engine instead of dbt. The benefit of this tight integration is that it resolves the issue with executing SQL twice and delays between execution of SQL and uploading it to the Soda cloud (see issue for more detailed explanation). The big downside is that it requires some monkey patching of the dbt test functionality, which might be hard to maintain, depending on how stable dbt is.

An alternative to the alternative is to implement the issue Bastien created, and that we find a way to add a post hook to the dbt test that uploads the results to the Soda cloud. I created an issue for this: https://github.com/dbt-labs/dbt-core/issues/4333. It will not solve the running SQL twice issue, but it minimizes the delay mentioned before and reduces the complexity of uploading the dbt artifacts to the Soda cloud.

janet-can commented 2 years ago

Good to know, @bastienboutonnet. This is now on my radar for documentation.

sodadata / soda-sql

Allow users to ingest dbt test results into Soda Cloud via soda-sql #160

Context

User Flow

Overall Approach

Specific Requirements

Concious Limitations